{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../img/ods_stickers.jpg\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 随机梯度下降和独热编码"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 介绍"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "不知不觉中，大数据时代已经到来。想象一下，如果你的训练数据集为 100G 或者更大，你在训练模型时，会怎么做呢？为了解释这个问题，本节将介绍什么是随机梯度下降算法，在线学习以及独热编码和哈希技巧。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 知识点"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 随机梯度下降\n",
    "- 在线学习\n",
    "- 独热编码\n",
    "- 哈希技巧"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 随机梯度下降"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "梯度下降是一种优化算法，因为其理解起来相对比较简单，所以梯度下降往往都是许多人在学习机器学习时最先接触到的优化算法。但它只是最基本的优化算法之一，在面对复杂的模型或数据时，很难达到较好的优化效果。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "梯度下降的主要思想很简单，就是通过在下降最快的方向移动来逐步逼近某些函数的最小值。 一般情况下，增长最快的方向指的是某个函数点的偏导数所指的方向，也就是某个函数点的斜率。也就是说，如果通过向相反方向移动，也就是函数下降最快的方向，就可以以最快的速度找到函数的最小值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src='https://doc.shiyanlou.com/courses/uid214893-20190505-1557034672266' width=50%>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "梯度下降的想法就跟上图所示的滑雪运动一样。 如果你想尽可能快地到达山脚，你就需要选择最陡的下降路线。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 实验例子"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了更好的理解梯度下降算法的工作原理，现在通过一个例子来进行说明，先导入实验所需模块。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import warnings\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "from scipy.sparse import csr_matrix\n",
    "from sklearn.datasets import fetch_20newsgroups, load_files\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import (accuracy_score, classification_report,\n",
    "                             confusion_matrix, log_loss, roc_auc_score,\n",
    "                             roc_curve)\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
    "from tqdm import tqdm_notebook\n",
    "\n",
    "%matplotlib inline\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "实验所用到的数据为 SOCR 数据集， 数据集记录的是每个人的体重和身高信息。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "导入数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Index</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>65.78331</td>\n",
       "      <td>112.9925</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>71.51521</td>\n",
       "      <td>136.4873</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>69.39874</td>\n",
       "      <td>153.0269</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>68.21660</td>\n",
       "      <td>142.3354</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>67.78781</td>\n",
       "      <td>144.2971</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Index    Height    Weight\n",
       "0      1  65.78331  112.9925\n",
       "1      2  71.51521  136.4873\n",
       "2      3  69.39874  153.0269\n",
       "3      4  68.21660  142.3354\n",
       "4      5  67.78781  144.2971"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_demo = pd.read_csv(\"../../data/weights_heights.csv\")  # 导入数据集\n",
    "data_demo.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了直观地看出体重与身高的关系，画出数据分布图。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0, 0.5, 'Height in inches')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEKCAYAAAAfGVI8AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJztnXuYXWV56H/v7NlJ9gTNJBKtjIQEq0kbYxIYBY1aA6eJlyakiEYe7PH2FHs5tFAMBuUhoeWUaPQBTm21HLy1Uk64OUJTRSrU00aJnTCJMUiq3BI2ouHABM1sMntm3vPH2muyZ8+67b3X2rf1/p5nnpm99rp8a823vvf73quoKoZhGEZ66Wp2AwzDMIzmYoLAMAwj5ZggMAzDSDkmCAzDMFKOCQLDMIyUY4LAMAwj5ZggMAzDSDkmCAzDMFKOCQLDMIyU093sBkTh5JNP1oULFza7GYZhGG3Fnj17nlXV+WH7tYUgWLhwIYODg81uhmEYRlshIk9G2c9UQ4ZhGCnHBIFhGEbKMUFgGIaRckwQGIZhpBwTBIZhGCmnLbyGDMNoTwaG8my/9yBPDxc4pTfHprWL2bCyr9nNMiowQWAYRiIMDOW58q79FIrjAOSHC1x5136AWISBCZn4MNWQYRiJsP3eg5NCwKVQHGf7vQfrPrcrZPLDBZQTQmZgKF/3udOICQLDMBLh6eFCVdurIUkhk0ZMNWQYPpjqoT5O6c2R9xj0T+nN1X3uJIVMGrEVgWF4YKqH+tm0djG5bGbKtlw2w6a1i+s+t58wiUPIpBETBIbhgake6mfDyj6uO38Zfb05BOjrzXHd+ctiWVX5CZnVS+azatv9LNq8k1Xb7jfBHRFTDRmGB6Z6iIcNK/sSUae55yxX3a1eMp879+QT81LqZEwQGIYHSeq300SSdpZKIbNq2/2+qzgTBMGYIDAMDzatXTzFBx7i02/70WnG6YGhPJtu30dxQgFnhr7p9n2T38d9r7aKqx0TBIbhgZfqIcmBOengq1rbVM/9b737wKQQcClOKFfe9SNAYr9XW8XVjgkCw/AhKf22F0HG6WYIgjgE03Ch6Lm9UJzw2Fb/vTZjFefS7qs5EwSG0QK0mlojqmCKcwDMDxdYuHknAL25LFvXL63qXI1exbnEvZprhlAxQWAYDcbrRW81tYafAMoPF1i17f7JWXbQADi3J8vzI96rgjCGC8VJe0K1wqDRM/E4V3PNUhFaHIFhNBC/QLXVS+YnFnxVC3NyWd/v3DZvvftAYKzFlnVLyWak5jYUJ7Qt4jbiXM01K34lsRWBiCwGdpRtOh24WlVvKH1/OfBZYL6qPptUOwwjCWpdvvu96A88coTrzl8W6Zxxqg78ziUh43ehOD7tPlzcAbBcVeO12olCO3j8xLmaa5aKMDFBoKoHgRUAIpIB8sA3Sp9PBdYAh5K6vmEkRT3L96AXPYpaI07VQdC5hmtU6QB0iTAwlJ+8nw0r+1i17X7PwbI3l+VooYh6nAfCB9NWMNLGaaRuloqwUaqhc4FHVfXJ0ufrgSvA9/9vGC1LPcv3enPk+F17690HAGdgjJpiIeg+orbHS/UzrjotL5NfSggR/0GgS+DY8THfe2mVfFBxptJIMj9TEI0SBO8HbgUQkfOAvKruCz7EMFqTepbv9b7oftcYLhS5amA/m+7YN2Vg3HTHPt+B0U9dkx8ueNosvJg9o5uMhx6pUjD6DZZBK4+MCMOl1YLXIN9K+aA2rOxj1+ZzeHzbu9m1+ZyaVyVJ5mcKQlSTnZSLyAzgaWAp8CvgAWCNqh4VkSeAfi8bgYhcDFwMsGDBgjOffPLJyl0Moyn4qTn6enPs2nxO6PH1qDP8rg3ODHrC43We25Nl6Oo109pw2Y69vrPxXDbDe87s44FHjvB0SbB44YoAr+8FeHzbu32OdPC7n4wI4x5jU28uy+yZ3aFtCrtuWhCRParaH7ZfI9xH3wk8pKq/EJFlwCJgnziziFcBD4nIG1X1mfKDVPUm4CaA/v5+UyEZLUO9OuEwW0CQoNi0djGX7tjreZyXEAA8XTi333swUC/rGrB3bT4nUGi4KqQoem2v+/J7ln6G6OFC0TdQze+6RjiNUA1dSEktpKr7VfXlqrpQVRcCTwFnVAoBw2hlkly+h+m9Xf/8eomixnJjBi4NWDmMjI6RHy7g5WS08GUnBmSv+7psx14Gn3zO81n21TiYN9Pltp1JdEUgIrOB3wU+luR1DKPRJBW4FCU4acu6pZ6zaFDP9A29HjEBc3LZ0Jm14G9HcHFXG16CYtejz3HVwH6u3bDM874UuOVBb8dBr5VCWFvbMbVDK3g9QcKCQFWPAS8L+H5hktc3jHYjiiG6MpXCnFwWEW8VULZL2Lp+6ZRtA0N5jo2OhbYlDn3srbsPc+2GZb735QoD91ruCui685dNi6v4+dGCp/orI8Kj171ryraBoTxb7z4wKezm9mTZsq66lBVJ00qJBi3FhGE0kcoZod9MfU4uy6pt90+ZObr6+8qZs+AMsBmRKdG55QKkON4Ys5tr8PXzj4fpAqdQHOfSHXvpq5ghu3mIvK5R/l0u28XxsYkpQuP5kSKb7qg+ZUWStFKiwcS9huKgv79fBwcHm90Mo4NpxhLdaxD3ItslIEwbvHsDVgKuMCg/x0mzuhke8Q/eSgJ3tj4wlPc1cgeRy2Ym7S9BHlNRierZ1QgWbd5Zs7dVVKJ6DVmuISP1NCMwaWAoz+W37QsVAgLM6O7ynMEPF4q+Sd0q9y5OKM83WAgAXHjWqZN/d9WQdqhQHOeae5xgOa8YjGrJDxe4amB/XeeIi3qDC+PEVENG6kl6iV652nBr63r5yVeiwLHRaAbTViIjwoVnnUr/afNCZ/LZjASqqp4fKU6mrAC4/LZ9kZ6dH18vGaiv3bAscL/y/1tvTxZVOFooxrZibGb9hEpsRWCkniQTfXmtNm558FBkb5h2pK83x6PXvYv+0+ZN3nsQs2d0h7qLXn7bvklhMBGDOvvW3YcDv6/8vz0/UgyMcq6FZkURe2ErAiP1JJHoy51Nep239a1y9eHes9dKy4ujhSJ7t6wJtCO4+YsgmutrGOOqk3UVvAbesLbHtWJsRv0EL2xFYKSeuBJ9uQnfFm7eyWU79tZt2HTb0Y4MDOUjr6h6Zjj3uGFln2fMg4s7+BbHp8dK1ELQzD5K29shRXZUTBAYqaeeJbrf4F/rrF9wfN7L2+GV1K3V+cSdP6I3YgT0sdFxll79bcf3f/3SQOH39HAhVpuJX5K6KKvBTkplYaohI5V4uYt6uRUGuZVWun/GpfKpTBA3+ORzkwbOduH42ATHx6LP3I+NjrPpjn1sv2A5ZyyYw65Hn/PcTwTi9njPDxemGKMhPLK501JZmCAwYqVVQuaDCIvoLNfvl/vju/sNPvkcDzxyJBbVTyVes0zXu+Wfdh/yTSzXCRTHNTTWIKn79/r/F4rjk1lQ5ybgNdRKmCAwYqOVQuaDCMtjHzTLLxTHp6REiJPKWWalUH3prPqNpIY3fv//cVVy2UxV6SnaYTJUiUUWG7FRb57+qNT7ogVFdAalQqiVvt4cI6NjnsFf7ozT/e2mVQCqSrpmxEOvj0eS24fD+p5XtHh5dHSjschio+E0ovB2HFHAQRGdcXuCCLBr8zlsWTfdCJrLZrjwrFPJZTOTAVLu/VxzzwETAjGSy0Yb6vxWXE+X7Ahhfa+VqqZVgwkCIzYaETIfx4u2ae1iJ39PGdkuYdPaxaFtrdZ/p/x8s8oGI8Fp9627D3vej1/qCKN6erJdnum5q2FOLhup7zViMpQEJgiM2GhE4e3YXrTKEb30edPaxZ4F2QFmdndVbRsYGR3jqoH9XHnX/imDu3ueelIlGOF0CYzUKQQAjpUK8HhR3veiToZct+NFm3eyatv9iea1ioIJAiM2GhEyH8eqwysNc3FcJyNFZ8/w9qGoxh3S5fmRYsenlGhlavEy8poGFMfVN56jvO9FmQw1I8lhGOY1ZMRK0iHzcSTq8ls9uKUZ4/bMsTl/e+H3/xpXJdvl1Hhwqex7lUWDvAzKrVSHwMUEgdFWRHnRwvDzDIpSmtFIOeJ4FgXFE4RNhlrRjpCYIBCRxcCOsk2nA1cDfcA6YBR4FPiwqg4n1Q6j8/B60cqDwCpdMSv39YsatZm7EUZxXJk9s5u9W9aE7+xDEkkO6yUxG4GqHlTVFaq6AjgTGAG+AdwHvE5VXw/8F3BlUm0w0kG5zhWY5orppXud2W3mMaM26p25N8KpoloapRo6F3hUVZ8Enizb/iBwQYPaYHQIlUE9x46P+RpjK3WvUctDGoYf9c7c41Bvxk2jBMH7gVs9tn+EqeqjSUTkYuBigAULFiTXMqNtGBjKs/XuA1OMuVF0+q4ReNPaxZFz5BvtS2W95jipZuYeFIXcKnUIXBJPMSEiM4CngaWq+ouy7Z8C+oHzNaQRlmLCiGMmn8tmTAgYNRNmeyqnVVJNtFKKiXcCD1UIgQ8BvwdcFCYEDAOiV7sKolAcpw1T+xstgBDN9uSy9e7pKUJaOdVEIwTBhZSphUTkHcAVwHpVHWnA9Y0OIC7XOpt2GH4IsOrV86YZcr1UTUGD+sBQ3jcWxVVTtkpEsUuigkBEZgO/C9xVtvnzwEuA+0Rkr4h8Mck2GMnSqFD5IANdLpsJLHFoGGFkRLh+4wpu+cM3TYuO95s7+E1Ogmb9bqxKq0QUu1gaaqNmGqkH9bMRzO3JsmXd0ras4mW0DjdsXOHZZweG8ly2Y6+nMMiIMKE6zRDsl+bcj7jTtJfTSjYCo0NpZMpdrzxGN2xcwdDVa9iwso8HHjkS+zWNdDC3J+s7cdl+78HAlBNeM/tq3UtbITOppZgwaqbRofJBLnet8DIZ7Ykq02oWu0TtV4XiOJffto/Lduydkm7cRYCeGRmOjU53eGhmRLGLrQiMmgnLBNrIVLut8DIZ7clwoeirq6+mX7krBK/aBwqMjk1MS3He7IhiFxMERs0Ehco3ItVuuaAZGR2bVmzGMKJSqdJ0+1Z+uOBXuqJqihPK7BndiaZprxVTDRk1ExQqv2rb/TWl2o1Sj9grwtgqehn14rp2uoO/axsotxFkRDj79Lk8dOhoTXEtRwvFuhLWJYUJAqMu/PT2tdgPBobybLpj32TRmPxwgU137Ju8jruP5QoyksJNWRJkIN716HPksl3M7ckyPFKkqxRxHIVWVWGaasiIRLX6/loqiV1zzwHPymHX3HNg8rPlCjJagUJxgheLE1y/cQWfe9/yaSpSL1rFHuCFCQIjlFr0/auXzK9qO/ird8q3m3eQ0SqUqzrLXZt7c1nm9jgBjm55y1ayB3hhqiEjlFpK6/n59dfq73/VwH6u3bDMt6iHYTQDd2ISVCyp1slLFHtZXNiKwAilFn2/32DtGuTKVxOu2imIrz94iKVXf9uEgNFS+Kk66/Waa3SBexMERii16PszAWk+yzt1ZXWxILyCcQyjWQTp/OuNum9k1D6YasiIgF+N35HRsckZSuUSNsyLorxTm/HXaCcEQlU1fhObqGqiRkfthwoCEVkF7FXVYyLyAeAM4MZS2UkjBbid3ct3f9Md+0CdYBk4Mduf25MN9e03w6/RbkRJEHfVwH7f76K6jza6wH2UFcEXgOUishy4HLgZ+AfgdxJpkdGSbFjZx/Z7D07Ls17p7gnODL9QHA8tGejWG/bL3W4YrUa5Ksg15uaHC5PVy8ImQO4qOszo67UKT9L9NIqNYKxURew84POq+rc49QSMlFHtDD4sxGb1kvkcGx2rvUGG0UB6sl3TAhvdWburCg1bBT8/4p/XqByvbLtJup9GWRH8SkSuBP4AeKuIdAFWBSSFxOm6ObcnywOPHPFcURhGkriz90wVEcEwNZlcPYGNUVKtQGML3EdZEWwEjgMfUdVngFcB2xNtldGSeCWZy2ak6mRv2S5hy7qlZiMwmsK4KtkuYWZ3df22XD9f74So1fp+qCAoDf53AjNLm54FvhF2nIgsLpWidH9eEJFLRWSeiNwnIj8t/Z5b3y0YjcJrubr9guVsfOOpVWVkHCvNwnp7bGFpNIfihDLikS7aD7fEpBsDE+QeHYVWyzkUWqpSRP4QuBiYp6qvFpHXAF9U1XMjX0QkA+SBs4A/BZ5T1W0ishmYq6qfCDreSlUmh5fBqy/ENa4y4rEWg28um2FsfJwq3kXDaAqVTg+5bCZQLZTNCLNndHO0UGROLsux0bEpKtCkyrl6EbVUZRQbwZ8CbwR2A6jqT0Xk5VW251zgUVV9UkTOA95e2v414N+AQEFgJENlJk9XX+q6gAKeYfPlx9S6RLbYAaNdqJwqF4rjvvaFjAjbL1g+5b1pZKqIWokiCI6r6qiUlkIi0k24Q0gl7wduLf39ClX9eenvZ4BXVHkuIyaCDF5+Bi3L/mkYzqSpcmXgN9NvpNG3VqIYi78nIp8EciLyu8DtwD1RLyAiM4D1peOmUHJL9RQqInKxiAyKyOCRI1aYvFaC0keHGay8vm81I5dhNAPXnbMVq43VQpQVwWbgo8B+4GPAv+AElUXlncBDqvqL0udfiMgrVfXnIvJK4JdeB6nqTcBN4NgIqrieUcJLjVOu8glzB/UyaFn2TyPtuIFd7TDTj0oUr6EJVf3fqvpeVb2g9Hc1A/OFnFALAdwNfLD09weBb1ZxLqMKwhJXebmDumQzwrHjY9NWEpvWLq65ZqthtCtun09i5l9t0ackiOI1tArYCpyGs4IQHK3O6aEnF5kNHAJOV9WjpW0vA24DFgBPAu9T1eeCzmNeQ7WxaPNOT72bAI9vezfgHyb/6xfHJvMHuccozotgKwKj0+jNZXnhxSITAcNhlDxD1eJVejVOr6I4vYa+BFwG7AGqshKq6jHgZRXb/h+OF5GRMFESV3ktb1dtu39aqLz7flQW9jaMduWGjSsAJidCYSRhH6ul6FMSRBEER1X1W4m3xIidWhNXhXV4ZbpvtWG0G7cPHuKhQ0cje8ElEQTW6HTTfvgKAhE5o/TnAyKyHbgLJ9UEAKr6UMJtM+rEnVFU68McxSBsQsBod3Y9GqiRnkJSmT8bnW7aD18bgYg8EHCcqmq8yrIAzEbQGMrtBTbjNwyHXLaLWdkMwyPF2APCWsVGEGosbgXSKAjqjUaMenzQ4G/CwDC86c1l2bp+aSyDdZKRx7EZi0Xkr4HPqOpw6fNc4HJVvar+ZhpehPn/hx17zT0Hphh7/Y6vvE7loG+2AMPwZrhQjPxOhtEK8QhRIovf6QoBAFV9HnhXck0yai1c7Q7sXsUxvI6/5p4DoYYyEwKG4U2SxeQbTRRBkBERNwU1IpLjREpqIwFq9SQIywNUfvzAUD60mpJhGMF0SsqVKO6jtwDfFZGvlD5/GCdrqJEQtXoShHXKLhEWbd45mTo6DFMLGUYwrVZXoFaipJj4NPA/gd8q/fyVqn4m6YalGa/UD1Hc18I65bgqimMziFI/4KKzF/imoDAMg8SKyTeaKKohVPVbqvrx0s+9STcq7dRauNovd1AtxZQEuOXBQ8zKRuoihtFxZLuE2TP8J0Jze7JNN/LGRRSvofOBTwMvxxkf3FxDL024bammFk8CvwCyy3bsrfr6rkrI7AhGGsmIsP29ToEZP1//LeuWNrGF8RLFRvAZYJ2q/iTpxhj14yVAKt1JXeb2ZFGl6jKThtEJ+FUZA5hQnXyPao3QbyeiCIJfmBBIhqRL2A0M5dl69wHPgT6bEbasW8r2ew+aIDBSRXnk7oprvuPZ/yvtba3g658kUQTBoIjsAAaYmmvorsRalQKqCRqrNkr46eECvT1Zjo4U8asNP3tGNxtW9tWkNjKMdsYVAgNDeV540XsStHrJ/Aa3qrlEEQQvBUaANWXbFCcJnVEjUdPPRhUYlfuF6faHC0VWbbuf3p6s2QGM1JAp85zYfu9B3/oDDzySrvK4oYJAVT/ciIakjahBY1EFRi1F5fPDBbJdVm/MSA/jqpMTqaC4m04JFItKUBrqK1T1MyLyN3jEFanqnyXasg4natBYVIFRa8ctBpVkMowOxJ1IBaVb75RAsagEOYm7BuJBnOpklT9GHUQNGvPrkKf05qbUOu2qJVjAMFLK08MFNq1d7LkizmakYwLFouK7IlDVe0q/a04nISK9wM3A63BWFR8BCsAXgVnAGPAnqvrDWq/RrkRxSRsYynumgshlM6xeMn+KTcDPDc4wjOm4E6ft710+xbNubk+WLeviSS/dTiRaj0BEvgb8u6reLCIzgB6cwvXXq+q3RORdwBWq+vag86S1HkFlEAuc6Kh+dVaDfKMNwzhBuRtp0q7czSLO4vW1NmAO8DbgQwCqOgqMiojieCIBzAGeTqoN7UZ5Z+zyGdB7Qtw+J1S5YeMKTyFiGMYJytNI11r/o1NIMpHMIuAI8BURGRKRm0VkNnApsF1EDgOfBa70OlhELhaRQREZPHKk81253BVAfriA4q/qcY3CQbaD8lxFhpF2enNZ3++eHi7UXP+jkwgVBCIyX0Q+KSI3iciX3Z8I5+4GzgC+oKorgWPAZuCPgctU9VTgMuBLXger6k2q2q+q/fPnd35wR1T3T1cA+CWYGxkdY2AoH3v7DKNdyIhMJmu8YeMK9m5Z4zspOqU3V3P9j04iimrom8C/A/8KVKNreAp4SlV3lz7fgSMI3gL8eWnb7TjG5NTj58ZWTrlXkbtkrUwh8fxIkUstWthIMeMl9Wi5WmfT2sWeieM2rV3sa2+bE7CS6DSiCIIeVf1EtSdW1WdE5LCILFbVg8C5wMPA6cDvAP8GnAP8tNpzdxoDQ3nfIjAZESZUPQ1YG1b2Wa4gw/Dgsh17uXTHXvoq3hs/g/Cm2/dNi6k5VlpdR7ETtLuxOdRrSESuBb6vqv9S9clFVuDM+GcAj+FUN1sK3IgjhF7EcR8NjEvodK+hVdvu910RfODsBVy7YZnvsYs277QqYoYRggioMk0wuKz8y+94plrp682xa/M5gef2S1MdpYZI0kT1GopiLP5z4J9FpCAiL4jIr0TkhSiNUNW9JT3/61V1g6o+r6r/oapnqupyVT0rTAikgSBd5J178oE6/7RFQBpGLbjzXdcjqPKdGvbJtxXFTtAJxuYopSpfoqpdqppT1ZeWPltRmhgJGszDOtSmtYvJZiyq2DCi4vVOBXnhhdEJxmZfQSAiS0q/z/D6aVwTOx8/DyCXoA61YWUfs2ckFg5iGB1J5TtVa51wqE+ItApBI8hfABcDn/P4TnEMvUYMuHrEy2/b5xk/ENahjoYYi7MZoThulgTDcPEqPAO1VSEL8khqF4JyDV1c+r26cc1JL26Hq6VDBWVRBEwIGEYZAp7vVK1VyDqhlKXpFFqIWjvU6iXz+fqDhxrRRMNoe5T4U0e0eylLEwQtQBQf5KB9dv7o581otmG0JZZ6ZTomCJpMlFKUVw3s55YHD03GC5TvA+FlKQ3DcPBTC6WdKLmGvhtlm1EbYT7IA0P5KUKgfJ/Lb9vH1rsPNKilhtH+JKEW6gSCSlXOwqkfcLKIzMURpuCkkLYnGRNhPsjb7z3oGzk8rmrpJQyjCjJWyc+TINXQx3BSRp+CU5rSfYIvAJ9PuF1tS7U5R8JqF7dTUIphtDpWtMkbX9WQqt6oqouAj6vq6aq6qPSzXFU7WhCU1wJete3+yGmdK2sK+IWzlxMWyNJOQSmGETe5bG0lU/zm/WYo9ibUWKyqfyMibwYWlu+vqv+QYLuaRhTjrR9B+v7KY68a2M+tuw8zrooI9GS7KBQnpmdF9AhWqWRuT5aeGd2RUlkbRjtRKE5M/p3LZpiV7fJ1jujNZTlaKHJKb47VS+Zz5558Wwd5NZJQQSAi/wi8GtjLiXoECnSkIKhmMK8kas6Rqwb2T/H7V4WR4oRnptHy2IL8cGFauupsRlA1FZLR+RSK48zs7iKXzUx5RwW4yOPd6T9tXlsHeTWSKO6j/cBva5JV7luIehJIhen7XW7dfdjz+Ft3H/ZMOV0erDIwlOeaew5MzoqK42YwNtLD0UKR6zeuiDTAt3uQVyOJIgh+DPwGkIqopaiDuRdRc474GazCDFmVQsAw2p0uoDsjjEZMg+LW5LYBPl6C3EfvwdFCvAR4WER+CBx3v1fV9ck3r/HUk0AqaoqIjIjnoB/k2uZV/MIw2p0JYHwidDfAdPxJErQi+GzDWtFC1JtAKsps5cKzTvXMDXThWaf6HnPNPQdMCBgdSdhKWMB0/AkTlH30e/WeXER6cUpVvg5ndfERVf2BiFwC/CmO8Xmnql5R77XiJOmlp2sHcL2GMiJceNapviUpB4bypg4yUsvj297d7CZ0PFG8hn7F9LrqR4FB4HJVfSzg8BuBb6vqBSIyA+gRkdXAecByVT0uIi+vse1tzbUblk0Z+N3YBa9VSDuVvDOMOHnNy2c3uwmpIIqx+AbgKeCfcFZp78dxJ30I+DLwdq+DRGQO8DbgQwCqOgqMisgfA9tU9Xhp+y/ruoMGUm3UcDXnDYpdCPNYymaEjW/wVjcZRqvT15vj2PExT++3kdGIBgSjLqKE7a1X1b9X1V+p6guqehOwVlV3AHMDjlsEHAG+IiJDInKziMwGXgu8VUR2i8j3ROQN9d9G8vhFDV81sL+mKORywhLPhXksFceVBx45UvV1DaPZuNlA/arsWXxMY4giCEZE5H0i0lX6eR/wYum7ICtPN3AG8AVVXQkcAzaXts8DzgY2AbeJTHeXEZGLRWRQRAaPHGn+IOc3WN/y4KGqUkp4ERa7EFbTmNK1LXzeaDcU592ak8t6fm8pVhpDFEFwEfAHwC+BX5T+/oCI5ID/EXDcU8BTqrq79PkOHMHwFHCXOvwQx4Ps5MqDVfUmVe1X1f758+dHvqGk8BusvdJDV6vTDyt+vWFlH+85M1wFZSkmjHYkP1zg2OgY2a6p80FzF20coYJAVR9T1XWqerKqzi/9/TNVLajqfwQc9wxwWETc/+S5wMPAALAaQEReC8wAnq37ThKmmplJtctZrxm/4LwgrrrJqpAZnUxxXDlpVjd9vTkEx25w3fnLzF20QQQFlF2hqp8Rkb/BQwWkqn8W4fyXALeUPIYeAz6MoyL6soj8GBgFPtjdMLdgAAAXTUlEQVQO6Su8As0q8/64VCM0XAN0oTg+GWhWfl5X3WQxBEanMzxSZOjqNc1uRioJ8hr6Sen3YK0nV9W9OLmKKvlAredsNOWeQr09WWZ2d8WW4bDSW6hSCLiYEDA6hWxGmD2j29NDKGwClZTXnhEcUHZP6ffXAESkR1VHGtWwVqByoH5+pEgum+H6jSsmO2DUDIeVnXj1kvmTAWXltPzSyDDqYPaMbrauX1p1Gpd60sMb4UiYVkZE3gR8CThJVReIyHLgY6r6J41oIEB/f78ODta8MKmZVdvu9zTA9vXm2LX5nMjnsTxBhuEgOJHC1c7u43wX07SqEJE9quqllZlC1ICytcDdAKq6T0TeVmf72oJ6UlKX4+V6GoSf7cEw2p1yT7hqBuA43kVbVfgTqQ6cqlYm0E/F1DbMrTMq1XTWXDbDRWcvqOr8htGKVAYH1eMOGse7GBa4mWaiCILDpVKVKiJZEfk4JwzJHU1YPeGoRO2sGRGuO9/JQWTBYUa7oxCbO2gc72JcK/xOJIpq6I9wksf1AXngOziZQzueypTUvT1ZVOGyHXvZfu/ByPrFKHWHc9nMlBclyjGG0cpkRKrS31dSqc9/z5l9PPDIkZr1+/UUnep0ohSvfxYnujiVuLrMevSLXjUOVi+ZH9ipK2sVG0a7EVZnIAiv9+3OPfm6VxW1Fp3qdIICyjwDyVwiBpR1DPUUtYfpxrGBofy0RHFeHg2b1i5m0+37KE6Y+dhoL+pRb9b7vnlRb9GpTiZoRVDur3kNsCXhtrQ0cekXB4byfPKuHzFSPJFeNz9cYNPt+0CcUHt326U79poHkdGWlKdIqWWwTUqfb/WOvQkKKPua+7eIXFr+OY1Uo1/081UeGMr7zu79ZvwmBIx2oDeXRcQJuvRKkQLVuWiaPr+xRHIfxcajyF4LfnULXOFgKh6jE5k9s5st65bS15uLJSNvXB57RjSieA0ZRNcvBuk2zU3N6FTCkiNG7fvlq+k5uSyzsl0MjxRNn58wQcbi8lrFPSLygvsVoKr60qQb12oE6RfdDuzn4eMKjyAPoEyXMG4rBqPFyGZk0nYVRKE4jgh4OQuVq3SCVKflwmS4MD23l5EMvqohVX2Jqr609NNd9vdL0igEgihXB/nhdvjK4hvlvGRmN3N7vCs1GUYzEGD7BcsjewB5CYFsl0yqdMJUpxb52xyi2giMAMJyCbm6zQ0r+9j+3uW++x0tFNmybinZjL+wMIxGckpvjg0r+9i1+Rye2PZubti4omq30JNmdU9RrVarOjWVavKYIIiBoI5aGVq/YWWf74t0Sm/OMShHWIYbRiOoNM66QuGGjStC62i7DI+cqD0QNNjHldvLqB4TBDHg11HdFLmV+k0vj4hslzAyOmZRxEbLMLcn66ub37Cyj+vOXzaZSyiI8vcjaLA3T6HmYYIgBqrtwJUvUW8uCyUf7GoJMDkATr4XN+mXYUQll82wZd3SwH3c1cHj297t27+EqauKoHel8r2wusWNI7QwTV0nF+kFbgZeh+OB9BFV/UHpu8uBzwLzS/mMfGlWYZpqqLXgxcBQnstv21dzXpZslzCju4tjo942CrcQCMDKv/xOTcLGSBcCXHT2Aq7dsCzyMV7Fl/zOk7biMM0kzsI09XAj8G1VvaBUwL6n1LhTgTXAoYSvnwh+Hbnazuy+PPUk5ypOKBPFCWbPyHgKA3cpPjCUNyFgREJhWh6sMKrJ42NpHlqPxASBiMwB3gZ8CEBVR4HR0tfXA1cA30zq+klRTRbSsJlPtZXL/BhX9RQCAqxeMn8ytYVhRKUWTx0b4NuXJG0Ei4AjwFdEZEhEbhaR2SJyHpBX1cCRSUQuFpFBERk8cqS62UmSRPV1DvKXdknaLU6BWx48xKe+sd9SWxhVYZ466SJJQdANnAF8QVVXAseArcAngavDDlbVm1S1X1X758+fn2AzqyOqr3MUgdGIl03B135gpIsPnL0gMKDRxTx10keSguAp4ClV3V36fAeOYFgE7BORJ4BXAQ+JyG8k2I5YierrHEVg+HlQGEbcdAn0nzaP7e9d7niplZjbk+UDZy8wT52Uk5iNQFWfEZHDIrJYVQ8C5wIPqeq57j4lYdAf5jXUSkStchQlja6fgc2qkhnl5LJdXHf+6xl88jlu3X24JueCCYUr79rPdecvY++WNQm00mhnkvYaugS4peQx9Bjw4YSvlzhRvSOiCgw/A9tlO/bGmvtbcAaU8oI4RnvwYul/du2GZVy7YRmLNu8M7BvZjDA2rr7poG22b1SSaBxBXLRDHIEX9fhLL9y8M/b25LIZZmW7zI20DZnbk2Xoamcmv2rb/b4rxr5SP/ObSAhw/cYV5sefEloljiDV1OpONzCUJyMSqgLwK2PZVUoF7DUjjMNd1Wg8z48UGRjKs2Fln+9qs1y376de7O3JRnZ/NtKDpZhoMaIGmfX15rh+4wpu2LhiWrbSTJdYSbkOxPU427Cyj/ec2UdGnP97RoT3nDl10uHniKCKpXo2pmErgiZTqT4aGR0LnbULsGvzOYCjJqjMVloc18AVRbZLLK6gDXFn+ANDee7ck5/8/46rcueePP2nzZuS5Ram27Iu27HX89yW6jndmCAIIOmcKF5RylEo9zzye4GDVhQnzepG1akAZcRLFJVePecG/xiVa+45MK2/uhMGFz+VkQWQpRtTDfkQJTK4XmpJMVHpeVTLCzw8UmTvljU1FRkx/OnrzfG59y2PFLRVC66A8RP+z48UQ/urpXo2vDBB4EOSZfMGhvKBnh+VuMOKV7CP14sdhis8KitPGbVTnkr5pFnJLLRdoR1V+Hv1V0v1bHhhqiEfaimbV65KmpPLIuLMvoMKdHvRm8sye2Z3FSqp6lQRlbM/t91GbfRV/I+Gq3DP7RIn2KvPJwDRpXzW7uU15IdXf7XkcEYlJgh8iBIZXE7lAF+ufy930YtS33jr+qWRaxk414weJFa5BIwimKrFz601mxHGx5VOCml7olTroRy/vuPFhJ4Y5P309xmRaeVOYaoh+NjxMU+bj+n+jSiYasiHanWpYQN8WIFuqH6ZXouNYQKmzP7jSoXtkstmePOr500rXyhAd5d0lBAAPG1Gm9YunubSG4TbN/z63Ofet3xanyivDrZr8zlsXb/UdP9GzdiKwIdqCm1ANPc79zxesz63vrFLFI+lWl3+ni5zQ4wzp5EA7zmzjwceOTJtRaBQ1cqlXfANxqrScSg/XJgUyq7nUaXKKYhq+6thlGMpJmIiivHXfbHDokK91DWV+0S5Zi7b5Tv4zu3J8usXx2KPJ0jSfbJVqRTi1TgC+OH1/zaMaomaYsJUQzER5r0TpUC360106Y69kTyWgpb9bsZKvzY9P1JMJKhsXHWaWqjTyQ8XWLXt/kk1URzBWRbtazQSUw3FROXS3M9ryN3Xq6xlmNG2coDZsLKPS30iRQvFiWmqhrjoCzBOgqMV8TMYdyrlDgG9PdnQxH5R/icW7Ws0ChMEMRLmlhek949itK3WA8RVT8QpBESc9BZuHWS/VYUS7hLZTvip9cpxZ/FRHneU/4l5/BiNwgRBgwgreh82+/PzAJkbYfYZJ6qw8i+/E3pNV2+eRDrtZvD0cCE0syc4/9coqrGwFZN5/BiNxGwEDSIsUjlo9lfpR+7aEhZt3omqk220kYQJgfJBrNVTWPTmspEG7spobDfvTyUZkdCZfJgQmNuTNUOx0VBMEDSIsEjloNnfhOo0jyI3p8xwoUgXzuDhGp/La9KW05vLThqpkxQd5YNYLSkwvGIQvOjJdvmeOyPCqlfPC7320UKRi85eEPg8vGbnfqqdcVXPey5PExIkBG7YuIKhq9eYEDAaSqKCQER6ReQOEXlERH4iIm8Ske2lzz8SkW+ISG+SbWgVworeb1jZ5zuAlx/rtbIoTig9M7pDg4u2rl86GYQURF9vjhs2rvBtT9ix5YOYl5fUB85e4JuYLZfNcFFFMfWLzl7geT9/ff7rp537ho0reGLbu3n0undxyx++afJ7P07pzdF/2jx6e7Jl5+6aIli9Zud+53Tvv7Jd15fatWvzOaHHGkajSdpGcCPwbVW9oFS3uAe4D7hSVcdE5NPAlcAnEm5H04lSw3jr+qWh+0TJgRQluChqYFs16Sf89NrlRnTXYF6cOFEzIUoAVf9p83zvJ2jwdK/tF5uxesl8j3sUtqwLTvMR9v8MchyIWs/aMBpFYgFlIjIH2Aucrj4XEZHfBy5Q1YuCzlVLQJk74OSHCzVFakY9fzVRnFGOCdsnSr3a6vIUBQetVSbSOzY6NqUQjqvvjnLtqNdMCq9n62f4rRSIUc8X9T6SrnVhGBA9oCxJQbACuAl4GFgO7AH+XFWPle1zD7BDVb8edK5qBUGQT34cA08zB7SweINq2lGLsKxnAPMTYlEG3aRYtHmnb5H3MBWaYbQ6rRBZ3A2cAXxBVVcCx4DN7pci8ilgDLjF62ARuVhEBkVk8MiRI1VdOMgnP46IzSRrFYRRrn/2opp2uIXQc9nMpPEzrABPZbKzagRfLam9kybMdmMYaSBJQfAU8JSq7i59vgNHMCAiHwJ+D7jIT22kqjepar+q9s+fP7+qC4cNLPUOPM0e0NzB2M/TpZp2NFKoteKgaxW7DCNBQaCqzwCHRcR9o84FHhaRdwBXAOtVdSSJa4cNLPUOPK0yoNXTjrAqaUkItVYcdK1il2Ek7zV0CXBLyWPoMeDDwH8CM4H7xAnKeVBV/yjOiwalAohj4GkVrw+vdmS7hJHRMRZt3hlokA7zBkpCqLVqqmSr2GWknUQFgaruBSoNFb+Z5DVh6oCThNdQqwxoXonujo2OTUb+VqaxcIlSJS0poWaDrmG0HlaPoIOI6pXj5ynj7tsKs/QkMddNIy1E9RqypHMdRFQjdtRgsk4kKPkfNH+VZxjNwHINdRBRjcetaLRtFH5eUlvvPjAlh1OYG61hdBImCDqIqAN8mj1l/FZNw4Vi02JDDKPZmGqog6jGiJ1Wo62fWswPqxJmpAETBB1GWgf4qPi5/s7KdnnWWbAIYyMNmCAwUoXfqgmmZ1pNi93EMEwQGLEQR2bVRhG0amqF9hlGozFBYNRNWD3mqPtEvVZSg7Wp1Yy0Yl5DRt1ESVwXR3K7yjKd5uJpGPFgK4IEaRVVSNJECWSLI2NrkDDpxOdqGI3CVgQJkabZa5RAtjgytjY7/bdhdComCBKimcVrGk2UQLY4oplbJf23YXQaJggSIk2z1yiRynFEM6c5NYZhJInZCBLCL4K1U2evUTxu6vXKaZX034bRaZggSIhWKV7TytRiTDcXT8OIHxMECWGz12DiiiswDKN+TBAkiM1e/TFXUMNoHRI1FotIr4jcISKPiMhPRORNIjJPRO4TkZ+Wfs9Nsg1Ga5ImY7phtDpJew3dCHxbVZcAy4GfAJuB76rqa4Dvlj4bKcNcQQ2jdUhMEIjIHOBtwJcAVHVUVYeB84CvlXb7GrAhqTYYrYu5ghpG65DkimARcAT4iogMicjNIjIbeIWq/ry0zzPAK7wOFpGLRWRQRAaPHDmSYDONZpDmKmmG0WqIqiZzYpF+4EFglaruFpEbgReAS1S1t2y/51U10E7Q39+vg4ODibTTMAyjUxGRParaH7ZfkiuCp4CnVHV36fMdwBnAL0TklQCl379MsA2GYRhGCIkJAlV9BjgsIq7S91zgYeBu4IOlbR8EvplUGwzDMIxwko4juAS4RURmAI8BH8YRPreJyEeBJ4H3JdwGwzAMI4BEBYGq7gW89FPnJnldwzAMIzqWfdQwDCPlJOY1FCcicgRHjRQ3JwPPJnDedsKegYM9B3sGLp30HE5T1flhO7WFIEgKERmM4lrVydgzcLDnYM/AJY3PwVRDhmEYKccEgWEYRspJuyC4qdkNaAHsGTjYc7Bn4JK655BqG4FhGIZhKwLDMIzUkxpBICKXicgBEfmxiNwqIrNEZJGI7BaRn4nIjlIEdEchIl8WkV+KyI/LtnkWBxKH/1V6Hj8SkTOa1/L48HkG20sFk34kIt8QkfJEiFeWnsFBEVnbnFbHj9dzKPvuchFRETm59Dk1faG0/ZJSfzggIp8p296RfaGSVAgCEekD/gzoV9XXARng/cCngetV9TeB54GPNq+VifFV4B0V2/yKA70TeE3p52LgCw1qY9J8lenP4D7gdar6euC/gCsBROS3cfrG0tIxfyciGTqDrzL9OSAipwJrgENlm1PTF0RkNU6dlOWquhT4bGl7J/eFKaRCEJToBnIi0g30AD8HzsHJigodWiRHVf8v8FzFZr/iQOcB/6AODwK9bqbYdsbrGajqd1R1rPTxQeBVpb/PA/6Pqh5X1ceBnwFvbFhjE8SnLwBcD1wBlBsMU9MXgD8Gtqnq8dI+bkbkju0LlaRCEKhqHkfKH8IRAEeBPcBw2WDwFJCWqih+xYH6gMNl+6XlmXwE+Fbp71Q9AxE5D8ir6r6Kr9L0HF4LvLWkJv6eiLyhtD01zyDp7KMtQUkHfh5O1bRh4HY8lshpRFVVRFLrOiYinwLGgFua3ZZGIyI9wCdx1EJpphuYB5wNvAEnO/LpzW1SY0nFigD4b8DjqnpEVYvAXcAqnOWuKwxfBeSb1cAG41ccKA+cWrZfRz8TEfkQ8HvARXrCjzpNz+DVOJOjfSLyBM69PiQiv0G6nsNTwF0lNdgPgQmcfEOpeQZpEQSHgLNFpEdEhBNFch4ALijtk6YiOX7Fge4G/nvJY+Rs4GiZCqmjEJF34OjF16vqSNlXdwPvF5GZIrIIx1j6w2a0MWlUdb+qvlxVF6rqQpwB8YxSUanU9AVgAFgNICKvBWbgJJ1LTV9AVVPxA1wDPAL8GPhHYCZwOs4/9mc46qKZzW5nAvd9K45dpIjzon8UeBmOt9BPgX8F5pX2FeBvgUeB/TheVk2/h4Sewc9w9L97Sz9fLNv/U6VncBB4Z7Pbn+RzqPj+CeDkFPaFGcDXS2PDQ8A5nd4XKn8sstgwDCPlpEU1ZBiGYfhggsAwDCPlmCAwDMNIOSYIDMMwUo4JAsMwjJRjgsDoCETkehG5tOzzvSJyc9nnz4nIX4Sc4/sRrvOEm6GzYvvbReTNPsesF5HNXt8FXOfXZef952qONYxqMUFgdAq7gDcDiEgXTmTo0rLv3wwEDvSq6jmQR+Tt7vU9znu3qm6r49yGkSgmCIxO4fvAm0p/L8UJDvqViMwVkZnAb+EECyEim0TkP0t59q9xT1A2C+8Skb8r5ae/T0T+RUQuKLvWJSLykIjsF5ElIrIQ+CPgMhHZKyJvLW+YiHxIRD5f+vurpTz/3xeRxyrO68dLRWRnKSf+F0uCzjBiwzqU0RGo6tPAmIgswJmZ/wDYjSMc+oH9qjoqImtwUgW8EVgBnCkib6s43fnAQuC3gT/ghIBxeVZVz8DJ0f9xVX0C+CJObYsVqvrvIc19JfAWnDxHUVYKbwQuKbXn1aX2GUZsmCAwOonv4wgBVxD8oOzzrtI+a0o/QzgrhCU4gqGctwC3q+qEOnl3Hqj4/q7S7z04AqNaBkrnfpgTKcCD+KGqPqaq4zgpEt5SwzUNw5dUpKE2UoNrJ1iGoxo6DFwOvAB8pbSPANep6t/XcZ3jpd/j1PYOHS/7WyLsX5kHxvLCGLFiKwKjk/g+jrrlOVUdV9XngF4c1Y5rKL4X+IiInAROGVMReXnFeXYB7ynZCl6BYwgO41fAS2K4By/eKE597S5gI/AfCV3HSCkmCIxOYj+Ot9CDFduOquqz4JSoBP4J+IGI7McpVVo5gN+Jk5nyYZyslA/hVLUL4h7g972MxTHwn8DngZ8AjwPfiPn8Rsqx7KOG4YGInKSqvxaRl+GkKl9VshcYRsdhNgLD8OafRaQXJ1f9X5kQMDoZWxEYhmGkHLMRGIZhpBwTBIZhGCnHBIFhGEbKMUFgGIaRckwQGIZhpBwTBIZhGCnn/wPj0SsP1ELUmwAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.scatter(data_demo[\"Weight\"], data_demo[\"Height\"])\n",
    "plt.xlabel(\"Weight in lb\")\n",
    "plt.ylabel(\"Height in inches\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这个数据集中，总共含有 $n$ 个样本，向量 $x$ 表示样本中每个人的重量，$y$ 则表示每个人的身高。假设 $x$ 与 $y$ 线性相关，则可以定义出一元线性回归模型："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " $$y_i = w_0 + w_1 x_i$$ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其中 $y_i$ 是 $i$ 身高值，$x_i$ 是 $i$ 体重值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模型训练的目标是：要找到一组权重值 $w_0$ 和 $w_1$ ，使得通过回归模型 $y_i = w_0 + w_1 x_i$ 预测出的身高与真实身高的平方差达到最小。用公式描述如下所示，下式中的 $SE(w_0，w_1)$ 也称为损失函数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$SE(w_0, w_1) = \\frac{1}{2}\\sum_{i=1}^{n}(y_i - (w_0 + w_1x_{i}))^2 \\rightarrow min_{w_0, w_1}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在最小化损失函数的过程中。使用梯度下降算法来进行优化。利用 $SE(w_0，w_1)$ 对权重 $w_0$ 和 $w_1$ 求偏导数，然后通过下面所示的更新公式来对权值进行更新。其中，$\\eta$ 为学习率："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$w_0^{(t+1)} = w_0^{(t)} -\\eta \\frac{\\partial SE}{\\partial w_0} |_{t}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$w_1^{(t+1)} = w_1^{(t)} -\\eta \\frac{\\partial SE}{\\partial w_1} |_{t} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算损失函数对权值的偏导数，将得到以下结果："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$ w_0^{(t+1)} = w_0^{(t)} + \\eta \\sum_{i=1}^{n}(y_i - w_0^{(t)} - w_1^{(t)}x_i)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$ w_1^{(t+1)} = w_1^{(t)} + \\eta \\sum_{i=1}^{n}(y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "关于梯度下降的数学运算过程在 [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> 《深度学习》</i>](http://www.deeplearningbook.org/contents/numerical.html) 中的数值计算章节也得到了非常详尽的介绍。 \n",
    "这里先不讨论局部最小值，鞍点，选择学习率和其他内容的问题。如果数据量不大的话，这个优化过程当然可以运行。但是，当训练样本很大时会发生什么呢？ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "显然，梯度下降存在一个问题，即梯度计算需要用到训练集中的每个样本。换句话说，该算法需要大量迭代才能找到最小值，并且每次迭代都需要使用训练样本的全部数据来进行运算。当训练数据集非常庞大时，则其将需要发费巨大的计算才能完成一次更新迭代。要训练一个模型，就要付出巨大的时间代价。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width='400px;' src=\"https://doc.shiyanlou.com/courses/uid214893-20190505-1557033319146\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了解决上述梯度下降存在的问题， 随机梯度下降算法被提出。相比于梯度下降算法，随机梯度下降每次迭代仅用一些小样本来进行运算，然后迭代更新权重，也就是每次迭代只从训练样本里抽取一部分数据，而不是所有的数据。这也极大的提高了计算效率，因为每次迭代的计算样本变少了。如果每次只取一个样本，则权重更新可表达为下式："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$w_0^{(t+1)} = w_0^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$ w_1^{(t+1)} = w_1^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然，随机梯度下降算法也带来了一个问题。就是随机梯度下降并不能保证在每次迭代中都会朝着最佳的方向前进。因为每次迭代取的只是一小批数据，而这一小批数据并不一定等同于整体数据，通过这小批数据所计算得到的梯度方向不一定为全局的最佳方向。因此，可能需要更多的迭代才能收敛。吴恩达在他的 [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> 机器学习课程</i>](https://www.coursera.org/learn/machine-learning) 中很好地说明了这一点。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src='https://doc.shiyanlou.com/courses/uid214893-20190505-1557033352168'>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上图是某函数的等值线图， $\\theta_0$ 和 $\\theta_1$ 对应于 $w_0$ 和 $w_1$ 。优化过程是找到此函数的全局最小值。 在随机梯度下降方法中，随着迭代次数的增加，权重的更新方向会更难预测，如图中的紫线所示。但是，不论是随机梯度下降还是梯度下降算法，最终结果都会收敛于同一个全局最小值点。而随机梯度下降算法则要快得多。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 在线学习方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "随机梯度下降为训练具有高达数百 GB 的大量数据的分类器和回归器提供了实现途径。因为每次迭代只需要拿取小批量的数据，而不是全部数据，因此大大提高训练速度。但其仍然存在一个问题。如果训练数据为 100G 或者更大，对于现在的普通电脑来说，一次读取全部数据到内存是不可能的，会出现内存爆满的情况。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为解决这一问题，在线学习方法被提出，在线学习的思想是将训练数据集 $(X,y)$ 存储在电脑的硬盘中而不将其加载到运行内存中，然后在训练模型时逐个读取，并更新模型的权重："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$w_0^{(t+1)} = w_0^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$w_1^{(t+1)} = w_1^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里我们不对随机梯度下降算法原理进行深入讨论，如果你感兴趣可以参考 [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> 凸优化</i>](https://www.amazon.com/Convex-Optimization-Stephen-Boyd/dp/0521833787) 这本书。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在 scikit-learn 中，使用随机梯度下降算法来进行优化的分类器和回归器在 `sklearn.linear_model` 中，并命名为 `SGDClassifier` 和 `SGDRegressor`。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 类别型特征处理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "目前，许多分类和回归算法是在欧几里德空间中操作的。这意味着，输入数据特征要用数值表示。 但是，在实际数据中，往往包含离散的类别特征，例如：是/否或 1 月/ 2 月/ ... / 12 月。如果将这些类别型特征输入都模型中，模型可能无法运行。 那应该如何去处理这种类别型的数据呢？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了解释说明这个问题。选择 UCI 的 [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> bank marketing</i>](https://archive.ics.uci.edu/ml/datasets/bank+marketing) 数据集来进行实验，因为该数据集中大部分的特征均为类别型特征。先读取数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>job</th>\n",
       "      <th>marital</th>\n",
       "      <th>education</th>\n",
       "      <th>default</th>\n",
       "      <th>housing</th>\n",
       "      <th>loan</th>\n",
       "      <th>contact</th>\n",
       "      <th>month</th>\n",
       "      <th>day_of_week</th>\n",
       "      <th>duration</th>\n",
       "      <th>campaign</th>\n",
       "      <th>pdays</th>\n",
       "      <th>previous</th>\n",
       "      <th>poutcome</th>\n",
       "      <th>emp.var.rate</th>\n",
       "      <th>cons.price.idx</th>\n",
       "      <th>cons.conf.idx</th>\n",
       "      <th>euribor3m</th>\n",
       "      <th>nr.employed</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>26</td>\n",
       "      <td>student</td>\n",
       "      <td>single</td>\n",
       "      <td>high.school</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>jun</td>\n",
       "      <td>mon</td>\n",
       "      <td>901</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.4</td>\n",
       "      <td>94.465</td>\n",
       "      <td>-41.8</td>\n",
       "      <td>4.961</td>\n",
       "      <td>5228.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>46</td>\n",
       "      <td>admin.</td>\n",
       "      <td>married</td>\n",
       "      <td>university.degree</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>cellular</td>\n",
       "      <td>aug</td>\n",
       "      <td>tue</td>\n",
       "      <td>208</td>\n",
       "      <td>2</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.4</td>\n",
       "      <td>93.444</td>\n",
       "      <td>-36.1</td>\n",
       "      <td>4.963</td>\n",
       "      <td>5228.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>49</td>\n",
       "      <td>blue-collar</td>\n",
       "      <td>married</td>\n",
       "      <td>basic.4y</td>\n",
       "      <td>unknown</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "      <td>telephone</td>\n",
       "      <td>jun</td>\n",
       "      <td>tue</td>\n",
       "      <td>131</td>\n",
       "      <td>5</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.4</td>\n",
       "      <td>94.465</td>\n",
       "      <td>-41.8</td>\n",
       "      <td>4.864</td>\n",
       "      <td>5228.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>31</td>\n",
       "      <td>technician</td>\n",
       "      <td>married</td>\n",
       "      <td>university.degree</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>cellular</td>\n",
       "      <td>jul</td>\n",
       "      <td>tue</td>\n",
       "      <td>404</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>-2.9</td>\n",
       "      <td>92.469</td>\n",
       "      <td>-33.6</td>\n",
       "      <td>1.044</td>\n",
       "      <td>5076.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>42</td>\n",
       "      <td>housemaid</td>\n",
       "      <td>married</td>\n",
       "      <td>university.degree</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>nov</td>\n",
       "      <td>mon</td>\n",
       "      <td>85</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>-0.1</td>\n",
       "      <td>93.200</td>\n",
       "      <td>-42.0</td>\n",
       "      <td>4.191</td>\n",
       "      <td>5195.8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age          job  marital     ...     cons.conf.idx euribor3m nr.employed\n",
       "0   26      student   single     ...             -41.8     4.961      5228.1\n",
       "1   46       admin.  married     ...             -36.1     4.963      5228.1\n",
       "2   49  blue-collar  married     ...             -41.8     4.864      5228.1\n",
       "3   31   technician  married     ...             -33.6     1.044      5076.2\n",
       "4   42    housemaid  married     ...             -42.0     4.191      5195.8\n",
       "\n",
       "[5 rows x 20 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"../../data/bank_train.csv\")\n",
    "labels = pd.read_csv(\"../../data/bank_train_target.csv\", header=None)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从上表中，可以看到大多数特征都没有用数字表示。也就是说，不能将这些数据直接输入大多数机器学习模型。因此，要将类别型数据改为数值型数据。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "先来分析 education 这个特征："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x7f1cfb4f23c8>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAckAAAD8CAYAAAAc/1/bAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAHttJREFUeJzt3XmYXVWd7vHvmzAECCZMchktUAQShIAFQgRaI07AxYEoKLYgerkgLWK3Q3iwEWlbcbjaRiaDMigINgiaRhSVSQSBVAmZCdCAglcZVGZFCG//sVfBoTg7VZUazqnU+3me89Taa6299m+fc5JfrbV3nSPbRERExIuNa3UAERER7SpJMiIiokaSZERERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVFjtVYHEIOz4YYbuqOjo9VhRESMKt3d3Q/Z3qivfkmSo1xHRwddXV2tDiMiYlSR9Nv+9Mtya0RERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVEjSTIiIqLGmE+Skm4oPzskLSrl10m6rJQPkDSrlN8uacoQHnuapH2HaryIiBhaYz5J2p7eR/tc2yeXzbcDA0qSklb0qUbTgCTJiIg2NeaTpKTH+2g/TNIpkqYDBwBflnSrpJeXx08ldUu6TtJ2ZZ9zJJ0h6SbgS5J2k/RrSbdIukHStpLWAE4CDirjHSRpHUlnSbq59H3bsD8BERFRK5/d2k+2b5A0F7jM9sUAkq4EjrR9h6TXAKcBM8oumwPTbS+X9BJgL9vPSNoH+LztAyWdAHTa/qcy3ueBq2wfLmkycLOkX9h+YoRPNyIiSJJcaZImAtOBiyT1VK/Z0OUi28tLeRJwrqRtAAOr1wz7JuAASR8v2xOALYGlvY59BHAEwJZbbjnIM4mIiDpJkitvHPCw7Wk17Y2zv38Drrb9DkkdwDU1+wg40PayFR3Y9hxgDkBnZ6cHEHNERAzAmL8mOUCPAesC2H4UuFvSuwBU2almv0nA70v5sGbjFVcAH1GZmkraeehCj4iIgUqSHJgLgU+Um2peDhwCfFDSfGAxUHejzZeAL0i6hRfO3q8GpvTcuEM141wdWCBpcdmOiIgWkZ3VutGss7PT+dLliIiBkdRtu7OvfplJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1Vuu7S7Szv//+ce6bdV2rwxiQzU/eq9UhRET0S2aSERERNZIkIyIiaiRJ9kHSYZJOaXUcEREx8pIkIyIiaoy5JCmpQ9Kihu2PSzpR0jWSvijpZkm3S3rR3SWS9pP0a0kbSjpH0mxJN0i6S9LM0keSvixpkaSFkg4q9adKOqCUL5V0VikfLunfS1xLJZ0pabGkn0laa2SelYiIaGbMJck+rGZ7N+BY4DONDZLeAcwC9rX9UKneBNgT2B84udS9E5gG7ATsA3xZ0ibAdUBP4t0MmFLKewG/LOVtgFNtTwUeBg4c0rOLiIgBSZJ8oUvKz26go6F+BvApYD/bf2mo/6HtZ20vATYudXsCF9hebvt+4FpgV0qSlDQFWALcX5LnHsANZd+7bd9aE8NzJB0hqUtS15+ffHjlzzYiIlZoLCbJZ3jheU9oKD9Vfi7nhX9D+t/AusAre431VENZKzqo7d8Dk4G3UM0crwPeDTxu+7Em4/WOoXGsObY7bXeuv/bkFR02IiIGYSwmyfuBl0raQNKaVEulffkt1dLndyRN7aPvdcBBksZL2gjYG7i5tN1ItZTbkyQ/Xn5GREQbGnNJ0vbTwElUievnwG393O824BDgIkkvX0HXS4EFwHzgKuCTtv9Y2q6juu55J/AbYH2SJCMi2pZstzqGGIQdN9nOlx96ZqvDGJB8LF1EtJqkbtudffUbczPJiIiI/kqSjIiIqJFvARnl1thsYpYvIyKGSWaSERERNZIkIyIiaiRJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVEj3wIyyt1/1538v4P2b3UYI+Zfvn9Zq0OIiDEkM8mIiIgaSZIRERE1xnySlNQhadEgxzhA0qwB9Jekf5d0u6Slko4ZzPEjImJ45JrkELA9F5g7gF0OA7YAtrP9rKSXDktgERExKGN+JlmsJun8Mqu7WNLakk6QNE/SIklzJAlA0jGSlkhaIOnCUneYpFNKeWNJl0qaXx7TmxzvKOAk288C2H5A0jhJd0jaqIwzTtKdPdsRETHykiQr2wKn2d4eeBT4MHCK7V1t7wCsBfTcQjoL2Nn2jsCRTcaaDVxreydgF2Bxkz4vBw6S1CXpJ5K2KQnzPOCQ0mcfYL7tB3vvLOmIsm/XE0/9faVPOiIiVixJsnKv7etL+TxgT+D1km6StBCYAUwt7QuA8yW9D3imyVgzgNMBbC+3/UiTPmsCf7PdCZwJnFXqzwLeX8qHA2c3C9b2HNudtjvXWXONgZxnREQMQJJkxU22TwNm2n4VVSKbUNr2A06lmiXOk7Qy13XvAy4p5UuBHQFs3wvcL2kGsBvwk5UYOyIihkiSZGVLSXuU8nuBX5XyQ5ImAjOhuk4IbGH7auBTwCRgYq+xrqS65oik8ZImNTneD4HXl/I/ALc3tH2LajZ7ke3lgzqriIgYlCTJyjLgaElLgfWolkvPBBYBVwDzSr/xwHllCfYWYLbth3uN9VGqpdqFQDcwBUDS5ZI2LX1OBg4sfb4AfKhh/7lUibfpUmtERIwc2b1XGqOVJHUCX7O9V3/6b7H+ZB/7xj2HOar2kY+li4ihIKm73BeyQvk7yTZSPpDgKJ6/wzUiIlooM8lRrrOz011dXa0OIyJiVOnvTDLXJCMiImokSUZERNRIkoyIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1kiQjIiJqJElGRETUSJKMiIiokSQZERFRI0kyIiKiRr4qa5R74LePceqRV7U6jBiAo8+Y0eoQIqKfMpOMiIiokSQZERFRY8wnSUkdkhYNcowDJM1aif1mS3p8MMeOiIjhk2uSQ8D2XGDuQPaR1AmsNzwRRUTEUBjzM8liNUnnS1oq6WJJa0s6QdI8SYskzZEkAEnHSFoiaYGkC0vdYZJOKeWNJV0qaX55TO99MEnjgS8Dn2yoW1fS3ZJWL9svadyOiIiRlyRZ2RY4zfb2wKPAh4FTbO9qewdgLWD/0ncWsLPtHYEjm4w1G7jW9k7ALsDiJn3+CZhr+w89FbYfA64B9itVBwOX2H66986SjpDUJanr8b89PPCzjYiIfkmSrNxr+/pSPg/YE3i9pJskLQRmAFNL+wLgfEnvA55pMtYM4HQA28ttP9LYKGlT4F3AN5rs+y3gA6X8AeDsZsHanmO703bnxAmT+3uOERExQEmSFTfZPg2YaftVwJnAhNK2H3Aq1SxxnqSBXtfdGXgFcKeke4C1Jd0JUBJ1h6TXAeNtD+qGooiIGJwkycqWkvYo5fcCvyrlhyRNBGYCSBoHbGH7auBTwCRgYq+xrgSOKv3HS5rU2Gj7x7b/l+0O2x3Ak7Zf0dDlO8D3qJlFRkTEyEmSrCwDjpa0lOqO09OpZo+LgCuAeaXfeOC8sgR7CzDbdu+Lgh+lWqpdCHQDUwAkXV6WWvtyfonhgsGdUkREDNaY/xMQ2/cA2zVp+nR59LZnkzHOAc4p5fuBtzXps2/N8XvPRPcELm6SfCMiYoSN+STZTiR9A3gr0DShRkTEyJLd+56VGE06Ozvd1dXV6jAiIkYVSd22O/vql2uSERERNZIkIyIiaiRJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVEjX5U1yv1t0WKWbrd9q8OINrb9bUtbHULEqJWZZERERI0kyYiIiBrDniQlbSfpVkm3SHr5EIx3gKRZQxFbr3EfH+oxIyJidBuSa5KSxtteXtP8duBi258bimPZngvMHYqxWkGSANl+ttWxRETEivU5k5TUIek2SedLWirpYklrS7pH0hcl/QZ4l6Rpkm6UtEDSpZLWk7QvcCxwlKSry3jvk3RzmV1+U9L48jhH0iJJCyV9rPQ9RtKSMuaFpe4wSac0xHZVab9S0pal/hxJsyXdIOkuSTNL/cTS7zflOG/rx/m/pfSfL+nKUre+pB+W494oacdSf6Kkjzfsu6jE2CFpmaTvAIuALWrO9+WSfiqpW9J1krbr9ysZERFDrr8zyW2BD9q+XtJZwIdL/Z9s7wIgaQHwEdvXSjoJ+IztYyWdATxu+yuStgcOAl5r+2lJpwGHAIuBzWzvUMaaXMafBWxl+6mGukbfAM61fa6kw4HZVDNXgE2APYHtqGaeFwN/A95h+1FJGwI3Sppr281OWtJGwJnA3rbvlrR+afoscIvtt0uaAXwHmNbHc7gNcKjtGyW9uuZ85wBH2r5D0muA04AZfYwbERHDpL/XJO+1fX0pn0eVfAC+DyBpEjDZ9rWl/lxg7ybjvAF4NTBP0q1le2vgLmBrSd+Q9Bbg0dJ/AXC+pPcBzzQZbw/ge6X83Ya4AH5o+1nbS4CNS52Az5eE/gtgs4a2ZnYHfmn7bgDbfy71e5bjYfsqYANJL1nBOAC/tX1jKb/ofCVNBKYDF5Xn5ptUif5FJB0hqUtS15+XN3taIiJiKPR3Jtl7ptWz/cQAjyeqmd9xL2qQdgLeDBwJvBs4HNiPKtn+b+B4Sa8awLGe6nVcqGatGwGvLjPZe4AJAzyHFXmGF/7i0Tj2c8+V7b80Od9jgYdt9zUjxfYcqlknO0xYq+ksOCIiBq+/M8ktJe1Ryu8FftXYaPsR4C+S9ipV/whcy4tdCcyU9FJ47trey8rS5zjbPwA+DewiaRywhe2rgU8Bk4CJvca7ATi4lA8BruvjPCYBD5QE+XrgZX30vxHYW9JWPfGW+uvK8ZD0OuAh248C9wA9y8+7AFs1G7TZ+Zb975b0rtJHJZFGRESL9HcmuQw4ulyPXAKcDnykV59DgTMkrU21nPiB3oPYXiLp08DPShJ8Gjga+CtwdqkDOA4YD5xXlnIFzLb9sKTGIT9S9vsE8GCzY/ZyPvBfkhYCXcBtzTpJutX2NNsPSjoCuKTE9gDwRuBE4KyybPtkOXeAHwDvl7QYuAm4vSaOzZqcL1SJ9/TyHK0OXAjM7+OcIiJimKjmnpXnO0gdwGU9N5lEe9lhwlq+qKOj1WFEG8vH0kW8mKRu25199csn7kRERNToc7nV9j1AZpFtasIOU9m+q6vVYURErJIyk4yIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1kiQjIiJqJElGRETUSJKMiIiokSQZERFRI0kyIiKiRpJkREREjSTJiIiIGv390uVoU4v/tJhXnfuqVocRY9DCQxe2OoSIYZeZZERERI0kyYiIiBpjPklK6pC0aJBjHCBp1gD6z5D0G0mLJJ0rKcveERFtaMwnyaFge67tk/vTV9I44FzgYNs7AL8FDh3O+CIiYuUkSVZWk3S+pKWSLpa0tqQTJM0rs705kgQg6RhJSyQtkHRhqTtM0imlvLGkSyXNL4/pvY61AfB327eX7Z8DB0oaJ+kOSRuVccZJurNnOyIiRl6SZGVb4DTb2wOPAh8GTrG9a5ntrQXsX/rOAna2vSNwZJOxZgPX2t4J2AVY3Kv9Iaqk3Fm2ZwJb2H4WOA84pNTvA8y3/eCQnGFERAxYkmTlXtvXl/J5wJ7A6yXdJGkhMAOYWtoXAOdLeh/wTJOxZgCnA9hebvuRxkbbBg4GvibpZuAxYHlpPgt4fykfDpzdLFhJR0jqktS1/LHlzbpERMQQSJKsuMn2acBM268CzgQmlLb9gFOpZonzVuamG9u/tr2X7d2AXwK3l/p7gfslzQB2A35Ss/8c2522O8evO36gh4+IiH5KkqxsKWmPUn4v8KtSfkjSRKol0Z6bbrawfTXwKWASMLHXWFcCR5X+4yVN6n0wSS8tP9cs45zR0PwtqtnsRbYzTYyIaKEkycoy4GhJS4H1qJZLzwQWAVcA80q/8cB5ZQn2FmC27Yd7jfVRqqXahUA3MAVA0uWSNi19PlGOtQD4L9tXNew/lyrxNl1qjYiIkaPqElm0i3JDz9ds79Wf/mtttZZfceIrhjmqiBfLx9LFaCap23ZnX/3yR+xtpHwgwVE8f4drRES0UJZb24jtk22/zPav+u4dERHDLTPJUW7qBlPpOrSr1WFERKySMpOMiIiokSQZERFRI0kyIiKiRpJkREREjSTJiIiIGkmSERERNZIkIyIiaiRJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI18wPlo9/9vgRMntTqKiBgJJz7S6gjGnMwkIyIiaiRJRkRE1Bj1SVJSh6RFTepPkrRPH/ueKOnjwx3LSo51jqSZQzFWRESsnFX2mqTtE1odQ0REjG6jfiZZjJd0pqTFkn4maa3GmZikfSXdJqlb0mxJlzXsO0XSNZLuknRMs8ElnSxpiaQFkr5S6jaWdKmk+eUxvS6W0n+apBvLGJdKWm9F9RER0XqrSpLcBjjV9lTgYeDAngZJE4BvAm+1/Wpgo177bge8GdgN+Iyk1RsbJW0AvAOYantH4HOlaTZwre2dgF2AxX3E8h3gU2WMhcBn+qiPiIgWW1WS5N22by3lbqCjoW074C7bd5ftC3rt+2PbT9l+CHgA2LhX+yPA34BvS3on8GSpnwGcDmB7ue2ee7NfFIukScBk29eW+nOBvevq+zpZSUdI6pLU9eCT7qt7RESspFUlST7VUF7OwK61rnBf289QzTIvBvYHfjqMsfSL7Tm2O213brS2hnr4iIgoVpUkuSLLgK0ldZTtgways6SJwCTblwMfA3YqTVcCR5U+48ussKkyy/yLpL1K1T9SLdU2rR9IfBERMXxW2btbe9j+q6QPAz+V9AQwrz/7Sboc+BBg4Efl2qaAfy5dPgrMkfRBqhnjUcAfVjDkocAZktYG7gI+0Ed9RES0mOxV/5qWpIm2H5ck4FTgDttfa3VcQ6Fz0/HuOmJiq8OIiJGQj6UbMpK6bXf21W8sLLcC/B9Jt1LdgTqJ6m7XiIiIFVrll1sByqxxlZg5RkTEyBkTSXKVtunOcGJXq6OIiFgljZXl1oiIiAFLkoyIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1kiQjIiJqJElGRETUSJKMiIiokSQZERFRI0kyIiKiRj7gfJRb+PtH6Jj141aHERExou45eb8ROU5mkhERETWSJCMiImokSUZERNRoSZKU1Clp9jAf44bys0PSewc51nmS3j40kUVExGjRkiRpu8v2MYMdR1LtjUe2p5diBzCoJDlUVhRvRES0nyFJkmW2tqhh++OSTpR0jaQvSrpZ0u2S9irtr5N0maRxku6RNLlh3zskbSxpI0k/kDSvPF5b2k+U9F1J1wPflTS1jH+rpAWStin9Hi9DngzsVdo/JumXkqY1HO9XknbqdT7jJJ0m6TZJPwc2bGjbVdK1krol/UTSxqV+93L8WyV9RdKtpf5Dkn4o6WrgilI3q8S8QNIJDWMf2nAup0nKcnhERAuNxH/Cq9neDTgW+Exjg+1ngR8B7wCQ9Brgt7bvB74OfM32rsCBwLcadp0C7GP7PcCRwNdtTwM6gft6HX8WcJ3taba/BnwbOKwc75XABNvze+0zE9iqHOcDwPTSf80S14G2Xw2cB/xb2eds4EMljt52Bt5p+w2S9gW2BF4DTAOmS5ouaYfyPEwvY6wGHNzsCZV0hKQuSV3Ln3ykWZeIiBgCI7H8d0n52U219Nnb94ETqJLMwWUbYB9giqSefi+RNLGU59r+ayn/Gjhe0ubAJbbv6COei4B/lfQJ4HDgnCZ99gYuKEn8PknXlPrtganAL0pc40v7hsAatm8u/b5X4u/xM9t/KeU3AW8FbinbE4FXApOBXYGuMvZawL3NTsD2HGAOwJqbbOM+zjciIlbSUCXJZ3jhrHRCQ/mp8nN5zfF+DbxC0kbA24HPlfpxwO62/9bYuSSQJ3q2bX9P0k3AfsDlkv6v7avqArX9ZFlCfRvwbuDVfZ/e84cHFtjeq1dMG9b07/FEQ1nA52x/u9cYHwPOsv2vA4gnIiKG0VAtt94PvFTSBmVJcv/+7mjbwKXAV4Gltv9Umn4GfKSnX+N1xEaStgbusj2baul2x15dHgPW7VX3LWA2MK9hhtfol8BB5drkZsA/lPolwGaSdivHXkPSVNsPAU9L6iz9mi6TFlcAH5S0Thlj85JkfwG8uyfhludyyxWMExERw2xIkqTtp4GTgJuBnwO3DXCI7wPv4/mlVoBjgM5yc8sSqmuPzbwbWFRulNkB+E6v9gXAcknzy2wN293Ao1RLvEB1PVTSGWXzYuB3VEnxbKrZLraforpe+VVJC6iWTF9T9jkcOFvSLVQz6aYXC21fXsa/UdJC4D+BibYXAp+lWspdQPVLwsY15xwRESNA1URubJG0KXANsF257jgUY060/XgpHw+sb/tfhmLsFVlzk228yaH/MdyHiYhoK4P97FZJ3bY7++o35v7EQNL7gZuA44cqQRYHlD/dWATsAXxhCMeOiIgWGJMzyVVJZ2enu7q6Wh1GRMSokplkRETEICVJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI38CcgoJ+kxYFmr46ixIfBQq4OokdhWXjvHl9hWTjvHBsMT38tsb9RXp3wJ8Oi3rD9/69MKkroS28C1c2zQ3vEltpXTzrFBa+PLcmtERESNJMmIiIgaSZKj35xWB7ACiW3ltHNs0N7xJbaV086xQQvjy407ERERNTKTjIiIqJEkOUpJeoukZZLulDRrhI55lqQHyteB9dStL+nnku4oP9cr9ZI0u8S3QNIuDfscWvrfIenQIYptC0lXS1oiabGkj7ZZfBMk3Vy+/HuxpM+W+q0k3VTi+L6kNUr9mmX7ztLe0TDWcaV+maQ3D0V8Zdzxkm6RdFk7xSbpHkkLy1fRdZW6dnldJ0u6WNJtkpZK2qONYtu2PGc9j0clHdtG8X2s/FtYJOmC8m+kLd5zL2A7j1H2AMYD/w1sDawBzAemjMBx9wZ2ARY11H0JmFXKs4AvlvK+wE8AAbsDN5X69YG7ys/1Snm9IYhtE2CXUl4XuB2Y0kbxCZhYyqtTfafp7sB/AgeX+jOAo0r5w8AZpXww8P1SnlJe7zWBrcr7YPwQvb7/DHwPuKxst0VswD3Ahr3q2uV1PRf4UCmvAUxul9h6xTke+CPwsnaID9gMuBtYq+G9dli7vOdeEOtQDpbHyDyovtT5iobt44DjRujYHbwwSS4DNinlTaj+bhPgm8B7evcD3gN8s6H+Bf2GMM4fAW9sx/iAtYHfAK+h+gPp1Xq/rsAVwB6lvFrpp96vdWO/Qca0OXAlMAO4rByrXWK7hxcnyZa/rsAkqv/o1W6xNYn1TcD17RIfVZK8lyrxrlbec29ul/dc4yPLraNTzxusx32lrhU2tv2HUv4jsHEp18U47LGXpZidqWZrbRNfWc68FXgA+DnVb70P236mybGei6O0PwJsMIzx/QfwSeDZsr1BG8Vm4GeSuiUdUera4XXdCngQOLssU39L0jptEltvBwMXlHLL47P9e+ArwO+AP1C9h7ppn/fcc5IkY8i4+lWupbdLS5oI/AA41vajjW2tjs/2ctvTqGZtuwHbtSqWRpL2Bx6w3d3qWGrsaXsX4K3A0ZL2bmxs4eu6GtXlh9Nt7ww8QbV82Q6xPadc1zsAuKh3W6viK9dB30b1i8amwDrAW0Y6jv5Ikhydfg9s0bC9ealrhfslbQJQfj5Q6utiHLbYJa1OlSDPt31Ju8XXw/bDwNVUy0mTJfV8PGTjsZ6Lo7RPAv40TPG9FjhA0j3AhVRLrl9vk9h6Zh3YfgC4lOoXjHZ4Xe8D7rN9U9m+mCpptkNsjd4K/Mb2/WW7HeLbB7jb9oO2nwYuoXoftsV7rlGS5Og0D9im3Am2BtVSytwWxTIX6Lnb7VCqa4E99e8vd8ztDjxSlniuAN4kab3y2+SbSt2gSBLwbWCp7a+2YXwbSZpcymtRXS9dSpUsZ9bE1xP3TOCq8lv/XODgcrffVsA2wM2Dic32cbY3t91B9V66yvYh7RCbpHUkrdtTpno9FtEGr6vtPwL3Stq2VL0BWNIOsfXyHp5fau2Jo9Xx/Q7YXdLa5d9uz3PX8vfciwzlBc48Ru5BdSfa7VTXtY4foWNeQHX94Gmq36I/SHVd4ErgDuAXwPqlr4BTS3wLgc6GcQ4H7iyPDwxRbHtSLRstAG4tj33bKL4dgVtKfIuAE0r91lT/qO+kWg5bs9RPKNt3lvatG8Y6vsS9DHjrEL/Gr+P5u1tbHluJYX55LO55r7fR6zoN6Cqv6w+p7v5si9jKuOtQzbgmNdS1RXzAZ4Hbyr+H71Ldodry91zvRz5xJyIiokaWWyMiImokSUZERNRIkoyIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1/gcth/LdJnrwmQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df[\"education\"].value_counts().plot.barh()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将类别型数据转换成为数值型数据最直接的解决方法就是将此特征的每个值映射到一个唯一的数字。例如，可以将 university.degree 映射到 0 ，将 basic.9y 映射到 1，依此类推。 这里可以使用 `sklearn.preprocessing.LabelEncoder` 来执行此映射。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_encoder = LabelEncoder()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该类的 `fit` 方法会查找所有一列特征中的所有类别并构建类别和数字之间的映射，用 `transform` 方法将类别转换为数字。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{0: 'basic.4y',\n",
       " 1: 'basic.6y',\n",
       " 2: 'basic.9y',\n",
       " 3: 'high.school',\n",
       " 4: 'illiterate',\n",
       " 5: 'professional.course',\n",
       " 6: 'university.degree',\n",
       " 7: 'unknown'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAADz5JREFUeJzt3X+s3XV9x/Hna7cUpJCWH5UgJRYygqJuQG6YRGc2nAhq8B+SlewHOpdmm1tkMzEQk0X/c8tmdIlRG3/MbIo/EDbCVGSCcS5b9RaKtJRqxSrthFYNP40i+N4f53vxcrm399tyvr3nQ5+P5KTf8z1fvudFv6ev+7mf8z3fk6pCktSOX1vuAJKkg2NxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMZY3JLUGItbkhqzYoidnnzyybV+/fohdi1Jz0lbtmz5UVWt7bPtIMW9fv16ZmZmhti1JD0nJfl+322dKpGkxljcktQYi1uSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1pndxJ5lKckeSm4YMJEk6sIMZcb8N2DFUEElSP72KO8k64PXAR4aNI0laSt8R9/uAdwC/HDCLJKmHJYs7yRuAfVW1ZYntNiaZSTKzf//+sQWUJD1dnxH3K4DLkuwGPg1clORf529UVZuqarqqpteu7XVlQknSIViyuKvqmqpaV1XrgQ3ArVX1h4MnkyQtyPO4JakxB/VFClX1VeCrgySRJPXiiFuSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMZY3JLUGItbkhpjcUtSYyxuSWqMxS1JjbG4JakxFrckNcbilqTGHNT1uPt6fO+j7Ln6v4bY9WDWvee3lzuCJPXiiFuSGmNxS1JjlizuJGcn2Trn9nCSqw5HOEnSMy05x11VO4FzAZJMAXuBGwbOJUlaxMFOlbwa+G5VfX+IMJKkpR1scW8Arh0iiCSpn97FnWQlcBnwuUUe35hkJsnMT3764LjySZLmOZgR96XA7VX1wEIPVtWmqpququkTj10znnSSpGc4mOK+AqdJJGnZ9SruJKuA1wDXDxtHkrSUXh95r6rHgJMGziJJ6sFPTkpSYyxuSWrMIFcHXHnacV5tT5IG4ohbkhpjcUtSYyxuSWqMxS1JjbG4JakxFrckNcbilqTGWNyS1BiLW5IaY3FLUmMsbklqjMUtSY2xuCWpMYNcHfCBe3fxj7//hiF2PXHe/pmbljuCpCOMI25JaozFLUmNWbK4k3wsyb4k2w5HIEnSgfUZcf8zcMnAOSRJPS1Z3FX1NeAnhyGLJKmHsc1xJ9mYZCbJzGM/f3xcu5UkzTO24q6qTVU1XVXTq45eOa7dSpLm8awSSWqMxS1JjelzOuC1wP8AZyfZk+Qtw8eSJC1myY+8V9UVhyOIJKkfp0okqTGDXGTqlDN/3YsvSdJAHHFLUmMsbklqjMUtSY2xuCWpMRa3JDXG4pakxljcktQYi1uSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1ZpCrA+77/iN84M9uHWLXGshbP3TRckeQ1JMjbklqjMUtSY3pVdxJLkmyM8muJFcPHUqStLg+XxY8BXwAuBQ4B7giyTlDB5MkLazPiPsCYFdV3VtVjwOfBt44bCxJ0mL6FPdpwH1z7u/p1j1Nko1JZpLMPPqzB8eVT5I0z9jenKyqTVU1XVXTxx2zZly7lSTN06e49wKnz7m/rlsnSVoGfYr7m8BZSc5IshLYANw4bCxJ0mKW/ORkVT2R5C+Bm4Ep4GNVtX3wZJKkBfX6yHtVfQH4wsBZJEk9+MlJSWrMIBeZev4Lj/eiRZI0EEfcktQYi1uSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMZY3JLUGItbkhpjcUtSYyxuSWqMxS1JjRnk6oA/27adHS968RC71nPEi+/ZsdwRpGY54pakxljcktSYXlMlSXYDjwBPAk9U1fSQoSRJizuYOe7fraofDZZEktSLUyWS1Ji+xV3Al5NsSbJxyECSpAPrO1Xyyqram+T5wC1J7qmqr83doCv0jQCnrhjkLENJEj1H3FW1t/tzH3ADcMEC22yqqumqmj5xyuKWpKEsWdxJViU5fnYZuBjYNnQwSdLC+gyNTwFuSDK7/aeq6kuDppIkLWrJ4q6qe4HfPAxZJEk9eDqgJDVmkHcRj3npS3jxzMwQu5akI54jbklqjMUtSY2xuCWpMRa3JDXG4pakxljcktQYi1uSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMYMcnXA7T/ezss+8bIhdi0d0F1X3rXcEaTBOeKWpMZY3JLUmD5fFnx6ktuS3J1ke5K3HY5gkqSF9ZnjfgJ4e1Xd3n3b+5Ykt1TV3QNnkyQtYMkRd1X9sKpu75YfAXYApw0dTJK0sIOa406yHjgP2DxEGEnS0noXd5LjgM8DV1XVwws8vjHJTJKZJx95cpwZJUlz9CruJEcxKu1PVtX1C21TVZuqarqqpqeOnxpnRknSHH3OKgnwUWBHVb13+EiSpAPpM+J+BfBHwEVJtna31w2cS5K0iCVPB6yqrwM5DFkkST34yUlJaozFLUmNGeTqgC856SXMXDkzxK4l6YjniFuSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMZY3JLUGItbkhpjcUtSYyxuSWqMxS1JjRnkIlP83x3wrtWD7FrShHnXQ8ud4IjjiFuSGmNxS1Jj+nxZ8DFJvpHkziTbk7z7cASTJC2szxz3z4GLqurRJEcBX0/yxar634GzSZIW0OfLggt4tLt7VHerIUNJkhbXa447yVSSrcA+4Jaq2jxsLEnSYnoVd1U9WVXnAuuAC5K8dP42STYmmUkys/+nDsglaSgHdVZJVT0I3AZcssBjm6pquqqm1x6bceWTJM3T56yStUnWdMvPA14D3DN0MEnSwvqcVXIq8IkkU4yK/rNVddOwsSRJi+lzVsm3gPMOQxZJUg9+clKSGmNxS1Jjhrk64AvOg3fNDLJrSTrSOeKWpMZY3JLUGItbkhpjcUtSYyxuSWqMxS1JjbG4JakxFrckNcbilqTGWNyS1BiLW5IaY3FLUmMGucjUXXsfYv3V/zHEriVpIu1+z+sP23M54pakxljcktQYi1uSGtOruJOsSXJdknuS7Ehy4dDBJEkL6/vm5PuBL1XV5UlWAscOmEmSdABLFneS1cCrgDcBVNXjwOPDxpIkLabPVMkZwH7g40nuSPKRJKvmb5RkY5KZJDNP/vShsQeVJI30Ke4VwPnAB6vqPOAx4Or5G1XVpqqarqrpqWNXjzmmJGlWn+LeA+ypqs3d/esYFbkkaRksWdxVdT9wX5Kzu1WvBu4eNJUkaVF9zyr5K+CT3Rkl9wJvHi6SJOlAehV3VW0FpgfOIknqwU9OSlJjBrk64MtOW83MYbxSliQdSRxxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMakqsa/0+QRYOfYdzweJwM/Wu4QizDboZnkbDDZ+cx2aIbI9sKqWttnw0HO4wZ2VtVEftIyyYzZDp7ZDt0k5zPboVnubE6VSFJjLG5JasxQxb1poP2Og9kOjdkO3STnM9uhWdZsg7w5KUkajlMlktSYsRZ3kkuS7EyyK8kzvpdyCEk+lmRfkm1z1p2Y5JYk3+n+PKFbnyT/1OX7VpLz5/w3V3bbfyfJlWPKdnqS25LcnWR7krdNSr4kxyT5RpI7u2zv7tafkWRzl+Ez3ZdnkOTo7v6u7vH1c/Z1Tbd+Z5LXPttsc/Y71X1B9U0TmG13kruSbE0y061b9uPa7XNNkuuS3JNkR5ILJyFbkrO7v6/Z28NJrpqEbHP2+9fdv4dtSa7t/p1MzOvuKVU1lhswBXwXOBNYCdwJnDOu/R/geV/F6Dswt81Z9/fA1d3y1cDfdcuvA74IBHg5sLlbfyKjb/Y5ETihWz5hDNlOBc7vlo8Hvg2cMwn5uuc4rls+CtjcPedngQ3d+g8Bf94t/wXwoW55A/CZbvmc7lgfDZzRvQamxnRs/wb4FHBTd3+Ssu0GTp63btmPa7ffTwB/2i2vBNZMSrY5GaeA+4EXTko24DTge8Dz5rze3jRJr7unso7xQFwI3Dzn/jXANeMMe4DnXs/Ti3sncGq3fCqj88oBPgxcMX874Argw3PWP227Meb8d+A1k5YPOBa4HfgtRh8qWDH/mAI3Axd2yyu67TL/OM/d7llmWgd8BbgIuKl7ronI1u1rN88s7mU/rsBqRuWTScs2L8/FwH9PUjZGxX0fox8IK7rX3Wsn6XU3exvnVMns//SsPd265XBKVf2wW74fOKVbXizj4Nm7X6POYzSynYh83VTEVmAfcAujkcGDVfXEAs/zVIbu8YeAk4bKBrwPeAfwy+7+SROUDaCALyfZkmRjt24SjusZwH7g490000eSrJqQbHNtAK7tliciW1XtBf4B+AHwQ0avoy1M1usOOALenKzRj7xlPXUmyXHA54GrqurhuY8tZ76qerKqzmU0ur0AeNFy5JgvyRuAfVW1ZbmzHMArq+p84FLgrUleNffBZTyuKxhNHX6wqs4DHmM0/TAJ2QDo5ogvAz43/7HlzNbNrb+R0Q+/FwCrgEuWI8tSxlnce4HT59xf161bDg8kORWg+3Nft36xjINlT3IUo9L+ZFVdP2n5AKrqQeA2Rr8GrkkyeymEuc/zVIbu8dXAjwfK9grgsiS7gU8zmi55/4RkA54anVFV+4AbGP3gm4TjugfYU1Wbu/vXMSryScg261Lg9qp6oLs/Kdl+D/heVe2vql8A1zN6LU7M627WOIv7m8BZ3TuwKxn9KnTjGPd/MG4EZt9pvpLR3PLs+j/u3q1+OfBQ9yvazcDFSU7ofupe3K17VpIE+Ciwo6reO0n5kqxNsqZbfh6jufcdjAr88kWyzWa+HLi1Gx3dCGzo3mE/AzgL+MazyVZV11TVuqpaz+h1dGtV/cEkZANIsirJ8bPLjI7HNibguFbV/cB9Sc7uVr0auHsSss1xBb+aJpnNMAnZfgC8PMmx3b/d2b+7iXjdPc04J8wZvQv8bUZzpe8c574P8JzXMpqP+gWj0cZbGM0zfQX4DvCfwIndtgE+0OW7C5ies58/AXZ1tzePKdsrGf3a9y1ga3d73STkA34DuKPLtg342279mYxeZLsY/Sp7dLf+mO7+ru7xM+fs651d5p3ApWM+vr/Dr84qmYhsXY47u9v22df6JBzXbp/nAjPdsf03RmdeTEq2VYxGpavnrJuIbN1+3w3c0/2b+BdGZ4ZMxOtu7s1PTkpSY57zb05K0nONxS1JjbG4JakxFrckNcbilqTGWNyS1BiLW5IaY3FLUmP+H3jwbo8iXYE/AAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "mapped_education = pd.Series(label_encoder.fit_transform(df[\"education\"]))\n",
    "mapped_education.value_counts().plot.barh()\n",
    "dict(enumerate(label_encoder.classes_))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从上图可以看出，转换之后，会把类别型的特征都替换成了数值型特征。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>job</th>\n",
       "      <th>marital</th>\n",
       "      <th>education</th>\n",
       "      <th>default</th>\n",
       "      <th>housing</th>\n",
       "      <th>loan</th>\n",
       "      <th>contact</th>\n",
       "      <th>month</th>\n",
       "      <th>day_of_week</th>\n",
       "      <th>duration</th>\n",
       "      <th>campaign</th>\n",
       "      <th>pdays</th>\n",
       "      <th>previous</th>\n",
       "      <th>poutcome</th>\n",
       "      <th>emp.var.rate</th>\n",
       "      <th>cons.price.idx</th>\n",
       "      <th>cons.conf.idx</th>\n",
       "      <th>euribor3m</th>\n",
       "      <th>nr.employed</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>26</td>\n",
       "      <td>student</td>\n",
       "      <td>single</td>\n",
       "      <td>3</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>jun</td>\n",
       "      <td>mon</td>\n",
       "      <td>901</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.4</td>\n",
       "      <td>94.465</td>\n",
       "      <td>-41.8</td>\n",
       "      <td>4.961</td>\n",
       "      <td>5228.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>46</td>\n",
       "      <td>admin.</td>\n",
       "      <td>married</td>\n",
       "      <td>6</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>cellular</td>\n",
       "      <td>aug</td>\n",
       "      <td>tue</td>\n",
       "      <td>208</td>\n",
       "      <td>2</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.4</td>\n",
       "      <td>93.444</td>\n",
       "      <td>-36.1</td>\n",
       "      <td>4.963</td>\n",
       "      <td>5228.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>49</td>\n",
       "      <td>blue-collar</td>\n",
       "      <td>married</td>\n",
       "      <td>0</td>\n",
       "      <td>unknown</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "      <td>telephone</td>\n",
       "      <td>jun</td>\n",
       "      <td>tue</td>\n",
       "      <td>131</td>\n",
       "      <td>5</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.4</td>\n",
       "      <td>94.465</td>\n",
       "      <td>-41.8</td>\n",
       "      <td>4.864</td>\n",
       "      <td>5228.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>31</td>\n",
       "      <td>technician</td>\n",
       "      <td>married</td>\n",
       "      <td>6</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>cellular</td>\n",
       "      <td>jul</td>\n",
       "      <td>tue</td>\n",
       "      <td>404</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>-2.9</td>\n",
       "      <td>92.469</td>\n",
       "      <td>-33.6</td>\n",
       "      <td>1.044</td>\n",
       "      <td>5076.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>42</td>\n",
       "      <td>housemaid</td>\n",
       "      <td>married</td>\n",
       "      <td>6</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>nov</td>\n",
       "      <td>mon</td>\n",
       "      <td>85</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>-0.1</td>\n",
       "      <td>93.200</td>\n",
       "      <td>-42.0</td>\n",
       "      <td>4.191</td>\n",
       "      <td>5195.8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age          job  marital     ...      cons.conf.idx euribor3m nr.employed\n",
       "0   26      student   single     ...              -41.8     4.961      5228.1\n",
       "1   46       admin.  married     ...              -36.1     4.963      5228.1\n",
       "2   49  blue-collar  married     ...              -41.8     4.864      5228.1\n",
       "3   31   technician  married     ...              -33.6     1.044      5076.2\n",
       "4   42    housemaid  married     ...              -42.0     4.191      5195.8\n",
       "\n",
       "[5 rows x 20 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"education\"] = mapped_education\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用同样的方法转换数据集的其他类别型的特征。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>job</th>\n",
       "      <th>marital</th>\n",
       "      <th>education</th>\n",
       "      <th>default</th>\n",
       "      <th>housing</th>\n",
       "      <th>loan</th>\n",
       "      <th>contact</th>\n",
       "      <th>month</th>\n",
       "      <th>day_of_week</th>\n",
       "      <th>duration</th>\n",
       "      <th>campaign</th>\n",
       "      <th>pdays</th>\n",
       "      <th>previous</th>\n",
       "      <th>poutcome</th>\n",
       "      <th>emp.var.rate</th>\n",
       "      <th>cons.price.idx</th>\n",
       "      <th>cons.conf.idx</th>\n",
       "      <th>euribor3m</th>\n",
       "      <th>nr.employed</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>26</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>901</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1.4</td>\n",
       "      <td>94.465</td>\n",
       "      <td>-41.8</td>\n",
       "      <td>4.961</td>\n",
       "      <td>5228.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>46</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>208</td>\n",
       "      <td>2</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1.4</td>\n",
       "      <td>93.444</td>\n",
       "      <td>-36.1</td>\n",
       "      <td>4.963</td>\n",
       "      <td>5228.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>49</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>131</td>\n",
       "      <td>5</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1.4</td>\n",
       "      <td>94.465</td>\n",
       "      <td>-41.8</td>\n",
       "      <td>4.864</td>\n",
       "      <td>5228.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>31</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>404</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>-2.9</td>\n",
       "      <td>92.469</td>\n",
       "      <td>-33.6</td>\n",
       "      <td>1.044</td>\n",
       "      <td>5076.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>42</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>85</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>-0.1</td>\n",
       "      <td>93.200</td>\n",
       "      <td>-42.0</td>\n",
       "      <td>4.191</td>\n",
       "      <td>5195.8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age  job  marital     ...       cons.conf.idx  euribor3m  nr.employed\n",
       "0   26    8        2     ...               -41.8      4.961       5228.1\n",
       "1   46    0        1     ...               -36.1      4.963       5228.1\n",
       "2   49    1        1     ...               -41.8      4.864       5228.1\n",
       "3   31    9        1     ...               -33.6      1.044       5076.2\n",
       "4   42    3        1     ...               -42.0      4.191       5195.8\n",
       "\n",
       "[5 rows x 20 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "categorical_columns = df.columns[df.dtypes == \"object\"].union([\"education\"])\n",
    "for column in categorical_columns:\n",
    "    df[column] = label_encoder.fit_transform(df[column])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种方法存在一个问题，那就是会引入了一些可能不存在任何意义的相对排序。例如，在 job 这个特征的值中隐含地引入了代数，这可以从客户端 ＃1 的工作中减去客户端 ＃2 的工作："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[1].job - df.loc[2].job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个操作有意义吗？显然没有， 现在使用转换后的特征来训练逻辑回归模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.89      1.00      0.94      6128\n",
      "           1       0.62      0.01      0.02       771\n",
      "\n",
      "   micro avg       0.89      0.89      0.89      6899\n",
      "   macro avg       0.75      0.50      0.48      6899\n",
      "weighted avg       0.86      0.89      0.84      6899\n",
      "\n"
     ]
    }
   ],
   "source": [
    "def logistic_regression_accuracy_on(dataframe, labels):\n",
    "    features = dataframe.values\n",
    "    labels = np.array(labels)\n",
    "    train_features, test_features, train_labels, test_labels = train_test_split(\n",
    "        features, labels.ravel()\n",
    "    )\n",
    "\n",
    "    logit = LogisticRegression(max_iter=1000, solver=\"lbfgs\")\n",
    "    logit.fit(train_features, train_labels)\n",
    "    return classification_report(test_labels, logit.predict(test_features))\n",
    "\n",
    "\n",
    "print(logistic_regression_accuracy_on(df[categorical_columns], labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到 1 类的召回率为 0 或接近于 0，这意味着模型几乎把数据都分给了 0 类。为了避免这个问题，这里将使用另一种转换方法：独热编码。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 独热编码"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "独热编码又称为 One-Hot 编码，是用只含 0 和 1 来表示类别型特征的方法。假设某项特征含有三个类别值。独热编码会创建三个向量来表示这三个类别值，例如：[1,0,0]，[0,1,0]，[0,0,1]。来看一个例子。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0  1  2  3  4  5  6  7  8  9\n",
       "0  0  0  0  0  0  0  1  0  0  0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "one_hot_example = pd.DataFrame([{i: 0 for i in range(10)}])\n",
    "one_hot_example.loc[0, 6] = 1\n",
    "one_hot_example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在使用 One-Hot 编码时，可以直接调用 `sklearn.preprocessing.OneHotEncoder` 接口。 默认情况下，One-Hot 将数据转换为稀疏矩阵以节省内存空间，因为大多数值都是零。 但是，在本实验这个特定的例子中，因为数据量比较少，所以没有遇到内存爆满的问题，因此这里使用「稠密」矩阵表示。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "onehot_encoder = OneHotEncoder(sparse=False, categories=\"auto\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "      <th>15</th>\n",
       "      <th>16</th>\n",
       "      <th>17</th>\n",
       "      <th>18</th>\n",
       "      <th>19</th>\n",
       "      <th>20</th>\n",
       "      <th>21</th>\n",
       "      <th>22</th>\n",
       "      <th>23</th>\n",
       "      <th>24</th>\n",
       "      <th>25</th>\n",
       "      <th>26</th>\n",
       "      <th>27</th>\n",
       "      <th>28</th>\n",
       "      <th>29</th>\n",
       "      <th>30</th>\n",
       "      <th>31</th>\n",
       "      <th>32</th>\n",
       "      <th>33</th>\n",
       "      <th>34</th>\n",
       "      <th>35</th>\n",
       "      <th>36</th>\n",
       "      <th>37</th>\n",
       "      <th>38</th>\n",
       "      <th>39</th>\n",
       "      <th>40</th>\n",
       "      <th>41</th>\n",
       "      <th>42</th>\n",
       "      <th>43</th>\n",
       "      <th>44</th>\n",
       "      <th>45</th>\n",
       "      <th>46</th>\n",
       "      <th>47</th>\n",
       "      <th>48</th>\n",
       "      <th>49</th>\n",
       "      <th>50</th>\n",
       "      <th>51</th>\n",
       "      <th>52</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0    1    2    3    4    5    6  ...    46   47   48   49   50   51   52\n",
       "0  0.0  1.0  0.0  1.0  0.0  0.0  0.0 ...   0.0  0.0  0.0  0.0  0.0  1.0  0.0\n",
       "1  1.0  0.0  0.0  0.0  0.0  1.0  0.0 ...   0.0  0.0  0.0  0.0  0.0  1.0  0.0\n",
       "2  0.0  1.0  0.0  0.0  0.0  1.0  0.0 ...   0.0  0.0  0.0  0.0  0.0  1.0  0.0\n",
       "3  1.0  0.0  0.0  0.0  0.0  1.0  0.0 ...   0.0  0.0  0.0  0.0  0.0  1.0  0.0\n",
       "4  0.0  1.0  0.0  1.0  0.0  0.0  0.0 ...   0.0  1.0  0.0  0.0  0.0  1.0  0.0\n",
       "\n",
       "[5 rows x 53 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoded_categorical_columns = pd.DataFrame(\n",
    "    onehot_encoder.fit_transform(df[categorical_columns])\n",
    ")\n",
    "encoded_categorical_columns.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在进行 One-Hot 编码之后，得到 53 列数据，分别对应于原数据集类别特征的唯一值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.90      0.99      0.94      6099\n",
      "           1       0.61      0.17      0.26       800\n",
      "\n",
      "   micro avg       0.89      0.89      0.89      6899\n",
      "   macro avg       0.76      0.58      0.60      6899\n",
      "weighted avg       0.87      0.89      0.86      6899\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(logistic_regression_accuracy_on(encoded_categorical_columns, labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由上面的结果可知， 1 类的召回率得到了改善。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 哈希技巧"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在实际的工程应用中，真实数据可能是不稳定的，也就是说我们无法保证一些类别特征不会出现新的值。 此问题可能会导致训练好的模型无法使用。 除此之外，类别编码需要对整个数据集进行分析，并在内存中构建映射，这使得处理大型数据集变得尤为困难。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有一种基于哈希的类别编码方法，并且被称为哈希技巧。哈希函数将类别型特征编码为不同的特征值，例如："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "university.degree → -5370095693728667446\n",
      "high.school → -7042998680499890429\n",
      "illiterate → -7750457402342120656\n"
     ]
    }
   ],
   "source": [
    "for s in (\"university.degree\", \"high.school\", \"illiterate\"):\n",
    "    print(s, \"→\", hash(s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一般情况下，在哈希函数中，我们不使用负数值以及比较大的数值，所以要将哈希值限定在一个范围空间。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "university.degree → 4\n",
      "high.school → 21\n",
      "illiterate → 19\n"
     ]
    }
   ],
   "source": [
    "hash_space = 25\n",
    "for s in (\"university.degree\", \"high.school\", \"illiterate\"):\n",
    "    print(s, \"→\", hash(s) % hash_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "哈希编码也可以创建类似于 One-Hot 编码的向量。可以看下面这个例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "job=student → 14\n",
      "marital=single → 21\n",
      "day_of_week=mon → 1\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "      <th>15</th>\n",
       "      <th>16</th>\n",
       "      <th>17</th>\n",
       "      <th>18</th>\n",
       "      <th>19</th>\n",
       "      <th>20</th>\n",
       "      <th>21</th>\n",
       "      <th>22</th>\n",
       "      <th>23</th>\n",
       "      <th>24</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0    1    2    3    4    5    6  ...    18   19   20   21   22   23   24\n",
       "0  0.0  1.0  0.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0  1.0  0.0  0.0  0.0\n",
       "\n",
       "[1 rows x 25 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hashing_example = pd.DataFrame([{i: 0.0 for i in range(hash_space)}])\n",
    "for s in (\"job=student\", \"marital=single\", \"day_of_week=mon\"):\n",
    "    print(s, \"→\", hash(s) % hash_space)\n",
    "    hashing_example.loc[0, hash(s) % hash_space] = 1\n",
    "hashing_example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里需要指出的是，哈希编码不仅需要散列特征值，也需要散列「特征名称 + 特征值」对。 因为这样可以区分不同特征的相同值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert hash(\"no\") == hash(\"no\")\n",
    "assert hash(\"housing=no\") != hash(\"loan=no\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "使用哈希编码时是否可能发生冲突？ 当然，这是可能的。不过只要哈希空间足够大，这个问题可以避免。 但一般情况下，即使发生冲突，回归或分类指标也不会受到太大影响。 在这种情况下，哈希冲突可作为正则化的一种形式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width='300px;' src=\"https://doc.shiyanlou.com/courses/uid214893-20190505-1557033384212\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你可能在说：WTF，哈希似乎违反直觉。但事实上，有时这是唯一可行的处理类别数据的方法。 而且，这种技术已被证明是有效的。等你处理了足够多的数据之后，你可能自己意识到这一点。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 实验总结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在本次实验中，我们主要讲述了随机梯度下降算法的原理以及它的优势。为解决大数据训练算法的问题，我们讲述了在线学习方法。为将类别型数据转化为数值型数据，讲述了 One-Hot 编码和哈希技巧。虽然在线学习在 scikit-learn 中实现的方法都很不错，但是在线学习还有一些其他的方法，你可以去了解 [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> Vowpal Wabbit</i>](https://github.com/JohnLangford/vowpal_wabbit/wiki)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-link\" aria-hidden=\"true\"> 相关链接</i>\n",
    "- [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> 深度学习</i>](http://www.deeplearningbook.org/)\n",
    "- [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> VW 快速学习方法</i>](http://fastml.com/blog/categories/vw/)\n",
    "- [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> 了解实验楼《楼+ 机器学习和数据挖掘课程》</i>](https://www.shiyanlou.com/louplus/)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
